This is a free open-access R reference book for applied epidemiologists and public health practitioners.
This book strives to:
What gaps does this book address?
How is this different than other R books?
If you have suggestions or want to contribute content, please post an issue or submit a pull request to this github repository.
This handbook is a collaborative team production. It has been conceived, written, and edited by epidemiologists and public health practitioners from around the world, who have drawn upon their experiences within a constellation of organizations including local/state/provincial/national health departments and ministries, the World Health Organization (WHO), MSF (Medecins sans frontiers / Doctors without Borders), UNHCR, WFP, hospital systems, and academic institutions.
Here are the team members:
Editor-in-Chief: Neale Batra
Editorial core team: Alex Spina, Amrish Baidjoe, Henry Laurenson-Schafer, Finlay Campbell, Pat Keating
Authors (in order of contributions): Neale Batra, Alex Spina, Paula Blomquist, Finlay Campbell, Henry Laurenson-Schafer, Isaac Florence, Natalie Fischer, Daniel Molling, Liza Coyer, Jonny Polonski, Yurie Izawa, Sara Hollis
Reviewers: …(list)…
Advisers …(list)…
The handbook received funding via a COVID-19 emergency capacity-building grant from Training Programs in Epidemiology and Public Health Interventions Network (TEPHINET).
Programmatic support was provided by the EPIET Alumni Network (EAN).
The multitude of tutorials and vignettes that provided foundational knowledge for development of handbook content are credited within their respective pages.
More generally, the following sources provided inspiration and laid the groundwork for this handbook:
The “R4Epis” project (a collaboration between MSF and RECON)
R Epidemics Consortium (RECON)
R for Data Science book (R4DS)
bookdown: Authoring Books and Technical Documents with R Markdown
Netlify hosts this website
Logo: CDC Public Health Image library, R Graph Gallery
2013 Yemen looking for mosquito breeding sites
Ebola virus
Survey in Rajasthan
Network
This handbook is not an approved product of any specific organization.
Although we strive for accuracy, we provide no guarantee of the content in this book.
This book is licensed under a Creative Commons license TBD…
Package and function names
Package names are written in bold (e.g. dplyr) and functions are written like this: mutate(). Packages referenced either in text or within code like this: dplyr::mutate()
Types of notes
NOTE: This is a note
TIP: This is a tip.
CAUTION: This is a cautionary note.
DANGER: This is a warning.
This handbook generally uses tidyverse R coding style. Read more here
We chose to frequently write code on new lines, in order to offer more understandable comments. As a result, code that could be written like this:
obs %>%
group_by(name) %>% # group the rows by 'name'
slice_max(date, n = 1, with_ties = F) # if there's a tie (of date), take the first row…is often written like this:
obs %>%
group_by(name) %>% # group the rows by 'name'
slice_max(
date, # keep row per group with maximum date value
n = 1, # keep only the single highest row
with_ties = F) # if there's a tie (of date), take the first rowBelow, we track significant editorial decisions around package and function choice. If you disagree or want to offer a new tool, please join/start a conversation on our Github page.
Table of package, function, and other editorial decisions
| Subject | Considered | Outcome & date | Brief rationale |
|---|---|---|---|
| Epiweeks | aweek, lubridate | lubridate, Dec 2020 | consistency, package maintenance prospects |
Data used in this handbook are either simulated or publicly available. All can be accessed from the “data” folder of our Github repository.
The case linelist used throughout much of the handbook is a simulated Ebola outbreak dataset from the outbreaks package.
This page is not intended to be a comprehensive “learn R” tutorial. However, it does cover some fundamentals that can be useful for reference or for refreshing your memory. See the section on recommended training for more comprehensive tutorials.
As stated on the R project website, R is a programming language and environment for statistical computing and graphics. It is highly versatile, extensible, and community-driven.
Cost
R is free to use! There is a strong ethic in the community of free and open-source material.
Reproducibility
Conducting your data management and analysis through a programming language (compared to Excel or another primarily point-click/manual tool) enhances reproducibility, makes error-detection easier, and eases your workload.
Community
The R community of users is enormous and collaborative. New packages and tools to address real-life problems are developed daily, and vetted by the community of users. As one example, R-Ladies is a worldwide organization whose mission is to promote gender diversity in the R community, and is one of the largest organizations of R users. It likely has a chapter near you!
How to install R
Visit this website https://www.r-project.org/ and download the latest version of R suitable for your computer.
How to install R Studio
Visit this website https://rstudio.com/products/rstudio/download/ and download the latest free Desktop version of RStudio suitable for your computer.
How to update R and RStudio
Your version of R is printed to the R Console at start-up. You can also run sessionInfo().
To update R, go to the website mentioned above and re-install R. Alternatively, you can use the installr package (on Windows) by running installr::updateR(). This will open dialog boxes to help you download the latest R version and update your packages to the new R version. More details can be found in the installr documentation.
Be aware that the old R version will still exist in your computer. You can temporarily run an older version (older “installation”) of R by clicking “Tools” -> “Global Options” in RStudio and choosing an R version. This can be useful if you want to use a package that has not been updated to work on the newest version of R.
To update RStudio, you can go to the website above and re-download RStudio. Another option is to click “Help” -> “Check for Updates” within RStudio, but this may not show the very latest updates.
TinyTex is a custom LaTeX distribution, useful when trying to produce PDFs from R.
See https://yihui.org/tinytex/ for more informaton.
To install TinyTex from R:
install.packages('tinytex')
tinytex::install_tinytex()
# to uninstall TinyTeX, run tinytex::uninstall_tinytex()Pandoc is a document converter, a separate software from R. It comes bundled with RStudio and should not need to be downloaded. It helps the process of converting Rmarkdown documents to formats like .pdf and adding complex functionality.
RTools is a collection of software for building packages for R
Install from this website: https://cran.r-project.org/bin/windows/Rtools/
This is often used to take “screenshots” of webpages. For example when you make a transmission chain with epicontacts package, an HTML file is produced that is interactive and dynamic. If you want a static image, if can be useful to use the webshot package to automate this process. This will require the external program “phantomjs”. You can install phantomjs via the webshot package with the command webshot::install_phantomjs().
First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.
For RStudio to function you must also have R installed on the computer (see this section for installation instructions).
RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!
By default RStudio displays four rectangle panes.
TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.
The R Console Pane
The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.
If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.
The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.
For Stata users, this pane is similar to your Do-file and Data Editor windows.
The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.
In Stata, this is most similar to Variables Manager window.
This pane also contains History where can see commands that you can previously. It also has a “Tutorial” tab where you can complete interactive R tutorials if you have the learnr package installed.
Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).
This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.
Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options. There you can change the default settings, including appearance/background color.
Scripts are a fundamental part of programming. Storing your code in a script (vs. typing in the console) has many advantages:
Rmarkdown is a type of script in which the script itself becomes a document (PDF, Word, HTML, Powerpoint, etc.). See the handbook page on R Markdown documents.
There is no difference between writing in a Rmarkdown vs an R notebook. However the execution of the document differs slightly. See this site for more details.
Shiny apps/websites are contained within one script, which must be named app.R. This file has three components:
shinyApp functionSee the handbook page on Shiny and dashboards, or this online tutorial: Shiny tutorial
In older times, the above file was split into two files (ui.R and server.R)
The working directory is the root folder location used by R for your work - where R looks for and saves files by default.
By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.
The working directory appears in grey text at the top of the RStudio Console pane. You can also return the current working directory with getwd() (do not put anything in the parentheses).
NOTE: If using an R project, the working directory will default to the R project root folder IF you open RStudio by clicking open the R project (the file with .rproj extension))
Perhaps the most common source of frustration for an R beginner on a Windows machine - typing in a filepath to import data.
Use here - Avoid these problems altogether by using relative pathways from the root of an R project that uses the here package. See the here tab in this Basics page for more details.
Slash direction - If typing in a filepath, beware the direction of the slashes. Enter them using forward slashes to separate components (“data/provincial.csv”). For Windows users, the default way that filepaths are displayed and copied is with backslashes (“\”) - so this means you must go change the direction of each slash. Or just use here and an R project as noted above.
Avoid using “absolute” paths - these are “full address” paths that direct to the same place regardless of the user’s working directory. For example, avoid this:
C:/Users/Name/Document/Analytic Software/R/Projects/Analysis2019/data/March2019.csv
This path above could break if the script is sent to someone on another computer! Instead, consider using an R project and having the filepath begin at that root directory (i.e. the working directory of the R project).
One possible exception is if working in a larger organization where you need to pull data from across several networked drives and don’t have permission to re-save the data in your R project. This can get tenuous, but it may be best to use these full absolute filepaths.
Use the command setwd() with the filepath in quotations, for example: setwd("C:/Documents/R Files")
CAUTION: If using an RMarkdown script be aware of the following:
In an R Markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved to. If you want to change this, you can use setwd() as above, but know the change will only apply to that specific code chunk.
To change the working directory for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:
knitr::opts_knit$set(root.dir = 'desired/filepath/here')Setting your working directory manually (point-and-click)
From RStudio click: Session / Set Working Directory / Choose Directory (you will have to do this each time you open RStudio)
If you are working in an R project, your working directory will by default be the root folder. This is convenient to maximize with the here package.
Everything in R is an object. These sections will explain:
<-)Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.
An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.
<-)Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:
object_name <- value (or process/calculation that produce a value)
EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object
current_weekis created when it is assigned the character value"2018-W10"(the quote marks make these a character value).
The objectcurrent_weekwill then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.
See the R commands and their output in the boxes below.
current_week <- "2018-W10" # this command creates the object current_week by assigning it a value
current_week # this command prints the current value of current_week object in the console## [1] "2018-W10"
NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output
CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.
The following command will re-define the value of current_week:
current_week <- "2018-W51" # assigns a NEW value to the object current_week
current_week # prints the current value of current_week in the console## [1] "2018-W51"
Dataset
Datasets are also objects (typically “dataframes”) and must be assigned names when they are imported. In the code below, the object linelist is created and assigned the value of a CSV file imported with the rio package and its import() function.
# linelist is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")You can read more about importing and exporting datasets with the section on importing data.
CAUTION: A quick note on naming of objects:
Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.
The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.
In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:
| Common structure | Explanation | Example |
|---|---|---|
| Vectors | A container for a sequence of singular objects, all of the same class (e.g. numeric, character). | “Variables” (columns) in data frames are vectors (e.g. the column age_years). |
| Data Frames | Vectors (e.g. columns) that are bound together that all have the same number of rows. | linelist is a data frame. |
Note that to create a vector that “stands alone”, or is not part of a data frame (such as a list of location names), the function c() is often used:
list_of_names <- c("Ruhengeri", "Gisenyi", "Kigali", "Butare")
All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:
| Class | Explanation | Examples |
|---|---|---|
| Character | These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. | “Character objects are in quotation marks” |
| Integer | Numbers that are whole only (no decimals) | -5, 14, or 2000 |
| Numeric | These are numbers and can include decimals. If within quotation marks the will be considered character. | 23.1 or 14 |
| Factor | These are vectors that have a specified order or hierarchy of values | Variable msf_involvement with ordered values N, S, SUB, and U. |
| Date | Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. | 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980 |
| Logical | Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) | TRUE or FALSE |
| data.frame | A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). | The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each. |
| list | A list is like vector, but holds other objects that can be other different classes | A list could hold a single number, and a dataframe, and a vector, and even another list within it! |
You can test the class of an object by feeding it to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.
class(linelist$age) # class should be numeric## [1] "numeric"
class(linelist$gender) # class should be character## [1] "character"
Sometimes, a column will be converted to a different class automatically by R. Watch out for this! For example, if you have a vector or column of numbers, but a character value is substituted in… the entire column will change to class character.
One common example is when working with a dataframe to print a table - if you make a total row and try paste/glue together percents in the same cell as numbers the entire columns above them will convert to character and can no longer be used for mathematical calculations.
num_vector <- c(1,2,3,4,5) # define vector as all numbers
class(num_vector) # vector is numeric class## [1] "numeric"
num_vector[3] <- "three" # convert the third element to a character
class(num_vector) # vector is now character class## [1] "character"
Sometimes, you will need to convert objects or columns to another class.
| Function | Action |
|---|---|
as.character() |
Converts to character class |
as.numeric() |
Converts to numeric class |
as.integer() |
Converts to integer class |
as.Date() |
Converts to Date class - Note: see section on dates for details |
as.factor() |
Converts to factor - Note: re-defining order of value levels requires extra arguments |
likewise, there are base R functions to check whether an object IS of a specific class, such as is.numeric(), is.character(), is.double(), is.factor(), is.integer()
Here is more online material on classes and data structures in R.
$)A column in a dataframe is technically a “vector”, or a sequence of values that must all be the same class (either character, numeric, logical, etc).
Columns within a data frame can be called, referenced, extracted, and created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. In this handbook, we try to use the word “column” instead of “variable”.
# Retrieve the length of the vector age_years
length(linelist$age) # (age is a column in the linelist data frame)By typing the name of the dataframe followed by $ you will also see a drop-down menu of all columns in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!
knitr::include_graphics(here::here("images", "Calling_Names.gif"))ADVANCED TIP: Some more complex objects (e.g. a list, or an epicontacts object) may have multiple levels which can be accessed through multiple dollar signs. For example epicontacts$linelist$date_onset
[])You may need to view parts of objects, also called “indexing”, which is often done using the square brackets [ ]. Ssing $ on a dataframe to access a column is also a type of indexing.
my_vector <- c("a", "b", "c", "d", "e", "f") # define the vector
my_vector[5] # print the 5th element## [1] "e"
Square brackets also work to return specific parts of an returned output, such as output of a summary() function:
# All of the summary
summary(linelist$age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000000000 6.00000000000 13.00000000000 16.11189655172 23.00000000000 77.00000000000 88
# Just one part of the summary
summary(linelist$age)[2] ## 1st Qu.
## 6
To view specific rows and columns of a dataset, you can do this using the syntax dataframe[rows, columns]:
# View a specific row (2) from dataset, with all columns (don't forget the comma!)
linelist[2,]
# View all rows, but just one column
linelist[, "date_onset"]
# View values from row 2 and columns 5 through 10
linelist[2, 5:10]
# View values from row 2 and columns 5 through 10 and 18
linelist[2, c(5:10, 18)]
# View rows 2 through 20, and specific columns
linelist[2:20, c("date_onset", "outcome", "age")]
# View rows and columns based on criteria
# *** Note the dataframe must still be names in the criteria!
linelist[linelist$age > 25 , c("date_onset", "date_birth", "age")]
# Use View() to see the outputs in the RStudio Viewer pane (easier to read)
# *** Note the capital "V" in View() function
View(linelist[2:20, "date_onset"])
# Save as a new object
new_table <- linelist[2:20, c("date_onset")] When indexing an object of class list, single brackets always return with class list, even if only a single object is returned. Double brackets, however, can be used to access a single element and return a different class than list.
Brackets can also be written after one another, as demonstrated below.
This visual explanation of lists indexing, with pepper shakers is humorous and helpful.
# define demo list
my_list <- list(
# First element in the list is a character vector
hospitals = c("Central", "Empire", "Santa Anna"),
# second element in the list is a dataframe of addresses
address = data.frame(
street = c("145 Medical Way", "1048 Brown Ave", "999 El Camino"),
city = c("Andover", "Hamilton", "El Paso")
)
)Now we extract, using various methods:
my_list[1] # this returns the element in class "list"## $hospitals
## [1] "Central" "Empire" "Santa Anna"
my_list[[1]] # this is a character vector## [1] "Central" "Empire" "Santa Anna"
my_list[["hospitals"]] # you can also index by name of the list element## [1] "Central" "Empire" "Santa Anna"
my_list[[1]][3] # this returns the third element of the "hospitals" character vector## [1] "Santa Anna"
my_list[[2]][1] # This returns the first column ("street") of the address dataframe## street
## 1 145 Medical Way
## 2 1048 Brown Ave
## 3 999 El Camino
You can remove individual objects by putting the name in the rm() function (no quote marks):
rm(object_name)You can remove all objects (clear your workspace) by running:
rm(list = ls(all = TRUE))This section on functions explains:
A function is like a machine that receives inputs, does some action with those inputs, and produces an output. What the output is depends on the function.
Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:
sqrt(49)## [1] 7
The object provided to a function also can be a column in a dataset. For example, when the function summary() is applied to the numeric column age in the dataset linelist, the output is a summary of the columns’s numeric and missing values.
summary(linelist$age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000000000 6.00000000000 13.00000000000 16.11189655172 23.00000000000 77.00000000000 88
NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.
Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.
Here is a fictional function, called oven_bake(), as an example of a typical function. It takes an input object (e.g. a dataset, or in this example “dough”) and performs operations on it as specified by additional arguments (minutes = and temperature =). The output can be printed to the console, or saved as an object using the assignment operator <-.
For example, the age_pyramid() command below produces an age pyramid plot based on defined age groups and a binary split column, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the dataframe to use, age_cat5 as the column to count, and gender as the binary column to use for splitting the pyramid by color.
# Create an age pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_cat5", split_by = "gender")The above command can be equivalently written as below, with newlines. This can be easier to read and to write # comments. To run this command you can highlight the entire command, or just place your cursor in the first line and then press Ctrl and Enter keys simultaneously.
# Create an age pyramid
apyramid::age_pyramid(
data = linelist, # case linelist
age_group = "age_cat5", # age group column
split_by = "gender" # two sides to pyramid
)The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.
# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_cat5", "gender")A more complex age_pyramid() command might include the optional arguments to:
proportional = TRUE when the default is FALSE)pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)NOTE: For arguments that you specify with both parts of the argument (e.g. proportiona = TRUE), their order among all the arguments does not matter.
apyramid::age_pyramid(
linelist, # use case linelist
"age_cat5", # age group column
"gender", # split by gender
proportional = TRUE, # percents instead of counts
pal = c("orange", "purple") # colors
)Packages contain functions.
An R package is a shareable bundle of code and documentation that contains pre-defined functions. Users in the R community develop and share packages all the time, so chances are likely that a solution exists for you! You will install and use hundreds of packages in your use of R.
On installation, R contains “base” packages and functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use. In this handbook, package names are written in bold. One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.
Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command at the beginning of each R session.
Think of R as your personal library: When you download a package, your library gains a new book of functions, but each time you want to use a function in that book, you must borrow that book from your library.
Your library is a folder on your computer with all your packages. Find out where R is installed in your computer, and look for a folder called “win-library”. For example: R\win-library\4.0 (the 4.0 is the R version - you’ll have a different library for each R version you’ve downloaded).
CRAN
CRAN (Comprehensive R Archive Network) is a public warehouse of R packages that have been published by R community members. Most often, R users download packages from CRAN.
Install vs. Load
To use a package, 2 steps must be implemented:
The basic function for installing a package is install.packages(), where the name of the package is provided in quotes. This can also be accomplished point-and-click by going to the RStudio “Packages” pane and clicking “Install” and typing the package name. Note all this is case-sensitive.
install.packages("tidyverse")The basic function to load a package for use (after it has been installed) is library(), with the name of the package NOT in quotes.
library(tidyverse)To check whether a package in installed or loaded, you can view the Packages pane in RStudio. If the package is installed, it is shown there with version number. If the box is checked, it is loaded for the current session.
Using pacman
This handbook uses the package pacman (abbreviation for “package manager”), which offers the useful function p_load(). This function combines the above two steps into one - it installs and/or loads packages, depending on what is needed. If the package has not yet been installed, it will attempt to install from CRAN, and then load it.
Below, we load three of the packages often used in this R basics page:
pacman::p_load(tidyverse, rio, here)Install from github
Sometimes, you need to install the development version of a package, from a Github repository. You can use p_load_gh() from pacman (this function is a “wrapper” around (it uses) install_github() from devtools package).
The first name listed in the quotation marks is the Github ID of the repository owner, and after the slash is the name of the repository. If you want to install from a branch other than the main/master branch, add it after an “@”.
# install development version of package from github repository
p_install_gh("reconhub/epicontacts")
# load development version of package which you had downloaded from github repository
p_load_gh("reconhub/epicontacts")
# install development version of package, but not the main branch
p_install_gh("reconhub/epicontacts@timeline")Read more about pacman in this online vignette
Install from ZIP or TAR
You could get the package from a URL:
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")Or download it to your computer in a zipped file:
Option 1:
library(devtools)
install_local("~/Downloads/dplyr-master.zip")Option 2:
install.packages(filepath_to_source, repos = NULL, type="source")
# like this:
install.packages("~/Downloads/dplyr-master.zip", repos=NULL, type="source")You can update packages by re-installing them. You can also click the green “Update” button in your RStudio Packages pane to see which packages have new versions to install. Be aware that your old code may need to be updated if there is a major revision to how a function works!
Use p_delete() from pacman, or remove.packages() from base R. Alternatively, go find the folder which contains your library and manually delete the folder.
Packages often depend on other packages to work. These are called dependencies. If a dependency fails to install, then the package depending on it may also fail to install.
See the dependencies of a package with p_depends(), and see which packages depend on it with p_depends_reverse()
For clarity in this handbook, functions are usually preceded by the name of their package using the :: symbol in the following way: package_name::function_name()
Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However writing the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()). Writing the package name will also load the package if it is not already loaded.
# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")It is not uncommon that two or more packages contain the same function name. For example, the package dplyr has a filter() function, but so does the package stats. The default filter() function depends on the order these packages are first loaded in the R session - the later one will be the default for the command filter().
You can check the order in your Environment pane of R Studio - click the drop-down for “Global Environment” and see the order of the packages. Functions from packages lower on that drop-down list will mask functions of the same name in packages that appear higher in the drop-down list. When first loading a package, R will warn you in the console if masking is occurring, but this can be easy to miss.
Here are ways you can fix masking:
dplyr::filter()p_load()), and start a new R sessionTo detach (unload) a package, use this command, with the correct package name:
detach(package:PACKAGE_NAME_HERE, unload=TRUE)Note that this may not resolve masking.
See this guide
To read more about a function, you can search for it in the Help tab of the lower-right RStudio. You can also run a command like ?thefunctionname (put the name of the function after a question mark) and the Help page will appear in the Help pane. Finally, try searching online for resources.
%>%)Two general approaches to working with objects are:
Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.
Read more on this approach in the tidyverse style guide
Here is a fake example for comparison, using fictional functions to “bake a cake”. First, the pipe method:
# A fake example of how to bake a care using piping syntax
cake <- flour %>% # to define cake, start with flour, and then...
left_join(eggs) %>% # add eggs
left_join(oil) %>% # add oil
left_join(water) %>% # add water
mix_together( # mix together
utensil = spoon,
minutes = 2) %>%
bake(degrees = 350, # bake
system = "fahrenheit",
minutes = 35) %>%
let_cool() # let it cool downHere is another link describing the utility of pipes.
Piping is not a base function. To use piping, the magrittr package must be installed and loaded (this is typically done by loading tidyverse or dplyr package). You can read more about piping in the magrittr documentation.
CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.
%<>%
This is an “assignment pipe” from the magritter package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, so object %<>% function() %>% function() is the same as object <- object %>% function() %>% function().
This approach to changing objects/dataframes may be better if:
Risks:
Either name each intermediate object, or overwrite the original, or combine all the functions together. All come with their own risks.
Below is the same fake “cake” example as above, but using this style:
# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)
batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)
cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)
cake <- let_cool(cake)Combine all functions together - this is difficult to read:
# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))This section details operators in R, such as:
%in% operator<-
The basic assignment operator in R is <-. Such that object_name <- value.
This assignment operator can also be written as =. We advise use of <- for general R use.
We also advise surrounding such operators with spaces, for readability.
<<-
If Writing functions, or using R in an interactive way with sourced scripts, then you may need to use this assignment operator <<- (from base R). This operator is used to define an object in a higher ‘parent’ R Environment. See this online reference.
%<>%
This is an “assignment pipe” from the magrittr package, which pipes an object forward and also re-defines the object. It must be the first pipe operator in the chain. It is shorthand, as shown below in two equivalent examples:
linelist <- linelist %>%
mutate(age_months = age_years * 12)The above is equivalent to the below:
linelist %<>% mutate(age_months = age_years * 12)%<+%
This is used to add data to phylogenetic trees with the ggtree package. See the page on Phylogenetic trees or this online resource book.
Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:
| Function | Operator | Example | Example Result |
|---|---|---|---|
| Equal to | == |
"A" == "a" |
FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <- |
| Not equal to | != |
2 != 0 |
TRUE |
| Greater than | > |
4 > 2 |
TRUE |
| Less than | < |
4 < 2 |
FALSE |
| Greater than or equal to | >= |
6 >= 4 |
TRUE |
| Less than or equal to | <= |
6 <= 4 |
FALSE |
| Value is missing | is.na() |
is.na(7) |
FALSE (see page on Missing data) |
| Value is not missing | !is.na() |
!is.na(7) |
TRUE |
Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.
| Function | Operator |
|---|---|
| AND | & |
| OR | | (vertical bar) |
| Parentheses | ( ) Used to group criteria together and clarify order of operations |
For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:
linelist_cleaned <- linelist %>%
mutate(case_def = case_when(
is.na(rdt_result) & is.na(other_case_in_home) ~ NA_character_,
rdt_result == "Positive" ~ "Confirmed",
rdt_result != "Positive" & other_cases_in_home == "Yes" ~ "Probable",
TRUE ~ "Suspected"
))| Criteria in example above | Resulting value in new variable “case_def” |
|---|---|
If the value for variables rdt_result and other_cases_in_home are missing |
NA (missing) |
If the value in rdt_result is “Positive” |
“Confirmed” |
If the value in rdt_result is NOT “Positive” AND the value in other_cases_in_home is “Yes” |
“Probable” |
| If one of the above criteria are not met | “Suspected” |
Note that R is case-sensitive, so “Positive” is different than “positive”…
In R, missing values are represented by the special value NA (a “reserved” value) (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA.
To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.
rdt_result <- c("Positive", "Suspected", "Positive", NA) # two positive cases, one suspected, and one unknown
is.na(rdt_result) # Tests whether the value of rdt_result is NA## [1] FALSE FALSE FALSE TRUE
Here is the R documentation on missing values
Variations on NA
NA is actually a logical value of length 1. You may also encounter NA_character_, NA_real_, NA_complex_, and NA_integer_, which correspond to specific classes.
The most prominent application of one of these variants in common epidemiology work is using case_when(). The Right-Hand Side (RHS) values must all be of the same class. Thus, if you have character outcomes on the RHS like “Confirmed”, “Suspect”, “Probable” and NA - you will get an error. Instead of NA you must have NA_character_. Likewise for integers, use NA_integer_.
NULL
NULL is the null object in R, often used to represent a list of 0 length. Use is.null() to evaluate this status.
More detail on the difference between NA and NULL is here
All the operators and functions in this page is automatically available using base R.
These are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.
| Objective | Example in R |
|---|---|
| addition | 2 + 3 |
| subtraction | 2 - 3 |
| multiplication | 2 * 3 |
| division | 30 / 5 |
| exponent | 2^3 |
| order of operations | ( ) |
| Objective | Function |
|---|---|
| rounding | round(x, digits = n) |
| rounding | janitor::round_half_up(x, digits = n) |
| ceiling (round up) | ceiling(x) |
| floor (round down) | floor(x) |
| absolute value | abs(x) |
| square root | sqrt(x) |
| exponent | exponent(x) |
| natural logarithm | log(x) |
| log base 10 | log10(x) |
| log base 2 | log2(x) |
DANGER: round() uses “banker’s rounding” which rounds up from a .5 only if the upper number is even. Use round_half_up() from janitor to consistently round halves up to the nearest whole number. See this explanation
# use the appropriate rounding function for your work
round(c(2.5, 3.5))## [1] 2 4
janitor::round_half_up(c(2.5, 3.5))## [1] 3 4
CAUTION: The functions below will by default include missing values in calculations. Missing values will result in an output of NA, unless the argument na.rm=TRUE is specified
| Objective | Function |
|---|---|
| mean (average) | mean(x, na.rm=T) |
| median | median(x, na.rm=T) |
| standard deviation | sd(x, na.rm=T) |
| quantiles* | quantile(x, probs) |
| sum | sum(x, na.rm=T) |
| minimum value | min(x, na.rm=T) |
| maximum value | max(x, na.rm=T) |
| range of numeric values | range(x, na.rm=T) |
| summmary** | summary(x) |
Notes:
quantile(): x is the numeric vector to examine, and probs = is a numeric vector with probabilities within 0 and 1.0, e.g c(0.5, 0.8, 0.85)summary(): gives a summary on a numeric vector including mean, median, and common percentilesDANGER: If providing a vector of numbers to one of the above functions, be sure to wrap the numbers within c() .
# If supplying raw numbers to a function, wrap them in c()
mean(1, 6, 12, 10, 5, 0) # !!! INCORRECT !!! ## [1] 1
mean(c(1, 6, 12, 10, 5, 0)) # CORRECT## [1] 5.666666666666667
| Objective | Function | Example |
|---|---|---|
| create a sequence | seq(from, to, by) | seq(1, 10, 2) |
| repeat x, n times | rep(x, ntimes) | rep(1:3, 2) or rep(c("a", "b", "c"), 3) |
| subdivide a numeric vector | cut(x, n) | cut(linelist$age, 5) |
| take a random sample | sample(x, size) | sample(linelist$id, size = 5, replace = TRUE) |
%in%A very useful operator for matching values, and for quickly assessing if a value is within a vector or dataframe.
my_vector <- c("a", "b", "c", "d")"a" %in% my_vector## [1] TRUE
"h" %in% my_vector## [1] FALSE
To ask if a value is not %in% a vector, put an exclamation mark (!) in front of the logic statement:
# to negate, put an exclamation in front
!"a" %in% my_vector## [1] FALSE
!"h" %in% my_vector## [1] TRUE
%in% is very useful when using the dplyr function case_when(). You can define a vector previously, and then reference it later. For example:
affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")
linelist <- linelist %>%
mutate(child_hospitaled = case_when(
hospitalized %in% affirmative & age < 18 ~ "Hospitalized Child",
TRUE ~ "Not"))Note: If you want to detect a partial string, perhaps using str_detect() from stringr, it will not accept a character vector like c("1", "Yes", "yes", "y"). Instead, it must be given a regular expression - one condensed string with OR bars, such as “1|Yes|yes|y”. For example, str_detect(hospitalized, "1|Yes|yes|y"). See the page on Characters and strings for more information.
You can convert a character vector to a named regular expression with this command:
affirmative <- c("1", "Yes", "YES", "yes", "y", "Y", "oui", "Oui", "Si")
affirmative## [1] "1" "Yes" "YES" "yes" "y" "Y" "oui" "Oui" "Si"
# condense to
affirmative_str_search <- paste0(affirmative, collapse = "|") # option with base R
affirmative_str_search <- str_c(affirmative, collapse = "|") # option with stringr package
affirmative_str_search## [1] "1|Yes|YES|yes|y|Y|oui|Oui|Si"
This section explains:
Common errors and warnings and troubleshooting tips can be found in the page on [Errors and warnings].
When a command is run, the R Console may show you warning or error messages in red text.
A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.
An error means that R was not able to complete your command.
Look for clues:
The error/warning message will often include a line number for the problem.
If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.
If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!
A few things to remember when writing commands in R, to avoid errors and warnings:
Variable_A is different from variable_AAny script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.
Help documentation
Search the RStudio “Help” tab for documentation on packages and specific functions. This is within the pane that also contains Files, Plots, and Packages (typically in the lower-right pane). As a shortcut, you can also type the name of a package or function into the R console after a question-mark to open the relevant Help page. For example: ?filter or ?diagrammeR.
Interactive tutorials
RStudio has built-in interative tutorials via the learnr package. If this package is installed, you can go through these tutorials via the “Tutorial” tab in the upper-right RStudio pane (which also contains Environment and History tabs).
This is an online R resource specifically for Excel users
There are many cheatsheets available on the RStudio website, for example:
A definitive text is the R for Data Science book by Garrett Grolemund and Hadley Wickham
Here we describe ways to import and export data:
The package we recommend for importing data is: rio. rio utilizes the file extension (e.g. .xlsx, .csv, .rds, etc.) to import or export the file correctly.
The alternative to using rio would be to use functions from many other packages that are specific to a type of file (e.g. read.csv(), read.xlsx(), etc.). These alternatives can be difficult to remember, whereas using import() from rio is relatively easy.
Below is a table, taken from the rio online vignette. It shows for each type of data: the file extension that is expected, the packages it uses to import or export the data (so you can look up specific arguments, if needed), and whether this functionality is included in the default installed version of rio.
| Format | Typical Extension | Import Package | Export Package | Installed by Default |
|---|---|---|---|---|
| Comma-separated data | .csv | data.table | data.table | Yes |
| Pipe-separated data | .psv | data.table | data.table | Yes |
| Tab-separated data | .tsv | data.table | data.table | Yes |
| SAS | .sas7bdat | haven | haven | Yes |
| SPSS | .sav | haven | haven | Yes |
| Stata | .dta | haven | haven | Yes |
| SAS | XPORT | .xpt | haven | haven |
| SPSS Portable | .por | haven | Yes | |
| Excel | .xls | readxl | Yes | |
| Excel | .xlsx | readxl | openxlsx | Yes |
| R syntax | .R | base | base | Yes |
| Saved R objects | .RData, .rda | base | base | Yes |
| Serialized R objects | .rds | base | base | Yes |
| Epiinfo | .rec | foreign | Yes | |
| Minitab | .mtp | foreign | Yes | |
| Systat | .syd | foreign | Yes | |
| “XBASE” | database files | .dbf | foreign | foreign |
| Weka Attribute-Relation File Format | .arff | foreign | foreign | Yes |
| Data Interchange Format | .dif | utils | Yes | |
| Fortran data | no recognized extension | utils | Yes | |
| Fixed-width format data | .fwf | utils | utils | Yes |
| gzip comma-separated data | .csv.gz | utils | utils | Yes |
| CSVY (CSV + YAML metadata header) | .csvy | csvy | csvy | No |
| EViews | .wf1 | hexView | No | |
| Feather R/Python interchange format | .feather | feather | feather | No |
| Fast Storage | .fst | fst | fst | No |
| JSON | .json | jsonlite | jsonlite | No |
| Matlab | .mat | rmatio | rmatio | No |
| OpenDocument Spreadsheet | .ods | readODS | readODS | No |
| HTML Tables | .html | xml2 | xml2 | No |
| Shallow XML documents | .xml | xml2 | xml2 | No |
| YAML | .yml | yaml | yaml | No |
| Clipboard default is tsv | clipr | clipr | No |
You can read more about the rio package in this online vignette
here())Relative filepaths differ from static filepaths in that they are relative from a specific directory location.
For example:
import("C:/Users/nsbatra/My Documents/R files/epiproject/data/linelists/ebola_linelist.xlsx")
import(here("data", "linelists", "ebola_linelist.xlsx"))
The package here can be used, often in conjunction with rio for importing or exporting. here locates files on your computer via relative pathways, usually within the context of R projects. Relative pathways are relative from a designated folder location, so that pathways listed in R code will not break when the script is run on a different computer.
This code chunk shows the loading of packages for importing data.
# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(rio, here)Use the package here and its function here() to implement relative pathways.
here() works best within R projects. When the here package is first loaded, it places a small file called “here” in the root-level folder of your R project as a “benchmark” or “anchor” for all other files in the project.
Thus, in your script, if you want to import or reference a file saved in your R project’s folders, you use the function here() to tell R where the file is in relation to that benchmark. These relative filepaths can be used for both importing and exporting/saving data.
If you are unsure where “here” is set to, run the function here() with the empty brackets:
# This command tells you the folder path that "here" is set to
here::here()Below is an example of importing the file “linelist.xlsx” which is located in the benchmark “here” folder. All you have to do is provide the name of the file in quotes (with the appropriate ending).
linelist <- import(here("linelist_raw.xlsx"))If the file is within a subfolder - let’s say a “data” folder - write these folder names in quotes, separated by commas, as below:
linelist <- import(here("data", "linelist.xlsx"))Using the here() command produces a character filepath, which is then processed by the import() function.
# the filepath
here("data", "linelist.xlsx")## [1] "C:/Users/Neale/OneDrive - Neale Batra/Documents/Analytic Software/R/Projects/R handbook/Epi_R_handbook/data/linelist.xlsx"
# the filepath is given to the import() function
linelist <- import(here("data", "linelist.xlsx"))When you import a dataset, you are doing the following:
The function import() (from the package rio) accepts a filepath within quotation marks. A few things to note:
With absolute filepath:
# Absolute pathway
p_load(rio)
my_data <- import("C:/Users/Timothy/Documents/cancer project/data/clean/survival_data.xlsx")With relative filepath:
# Absolute pathway
p_load(rio)
my_data <- import(here("data", "clean", "survival_data.xlsx")If importing a specific sheet from an Excel file, include the sheet name in the which = argument of import(). For example:
my_data <- rio::import("my_excel_file.xlsx", which = "Sheetname")If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parenthese of the here() function.
# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelists", "linelist.xlsx"), which = "Sheet1")` You can import data manually via one of these methods:
file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:# Manual selection of a file.
# When this command is run, a POP-UP window should appear.
# The filepath of the selected file will be supplied to the import() command.
my_data <- import(file.choose())TIP: The pop-up window may appear BEHIND your RStudio window.
You can import data from an online Google spreadsheet with the googlesheet4 package and by authenticating your access to the spreadsheet.
pacman::p_load("googlesheets4")Below, a demo Google sheet is imported and saved. This command may prompt confirmation of authentification of your Google account. Follow prompts and pop-ups in your internet browser to grant Tidyverse API packages permissions to edit, create, and delete your spreadsheets in Google Drive.
The sheet below is “viewable for anyone with the link” and you can try to import it.
Gsheets_demo <- read_sheet("https://docs.google.com/spreadsheets/d/1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY/edit#gid=0")The sheet can also be imported using only the sheet ID, a shorter part of the URL:
Gsheets_demo <- read_sheet("1scgtzkVLLHAe5a6_eFQEwkZcc14yFUx1KgOMZ4AKUfY")Another package, googledrive offers useful functions for writing, editing, and deleting Google sheets. For example, using the gs4_create() and sheet_write() functions found in this package.
Here are some other helpful online tutorials: basic importing tutorial more detail interaction between the two packages
Scraping data from a website - TBD - Under construction
Sometimes, you may want to avoid importing a row of data. You can do this with the argument skip = if using import() from rio on a .xlsx or .csv file. Provide the number of rows you want to skip.
linelist_raw <- import("linelist_raw.xlsx", skip = 1) # does not import header rowUnfortunately skip = only accepts one integer value, not a range (e.g. “2:10” does not work). To skip import of specific rows that are not consecutive from the top, consider importing multiple times and using bind_rows() from dplyr. See the example below of skipping only row 2.
Your data may have a second row of data, for example if it is a “data dictionary” row (see example below).
This situation can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.
The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names.
For Excel files:
# import first time; store the column names
linelist_raw_names <- import("linelist_raw.xlsx") %>% names() # save true column names
# import second time; skip row 2, and assign column names to argument col_names =
linelist_raw <- import("linelist_raw.xlsx",
skip = 2,
col_names = linelist_raw_names
) For CSV files:
# import first time; sotre column names
linelist_raw_names <- import("linelist_raw.csv") %>% names() # save true column names
# note argument for csv files is 'col.names = '
linelist_raw <- import("linelist_raw.csv",
skip = 2,
col.names = linelist_raw_names
) Backup option - changing column names as a separate command
# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_namesBonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. See this vignette
Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it. This tip is adapted from this post.
dict <- linelist_2headers %>% # begin: linelist with dictionary as first row
head(1) %>% # keep only column names and first dictionary row
pivot_longer(cols = everything(), # pivot all columns to long format
names_to = "Column", # assign new column names
values_to = "Description")In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row. Replace “df” with the name of your dataset.
names(df) <- paste(names(df), df[1, ], sep = "_")Since a data frame is a combination of vertical vectors (columns), R by default expects manual entry of data to also be in vertical vectors (columns).
# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death <- c(1, 0, 1, 0)CAUTION: All vectors must be the same length (same number of values).
The vectors can then be bound together using the function data.frame():
# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)And now we display the new dataset:
Use the tribble function from the tibble package from the tidverse (onlinetibble reference).
Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.).
You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. For example:
# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
~colA, ~colB,
"a", 1,
"b", 2,
"c", 3
)And now we display the new dataset:
If you copy data from elsewhere and have it on your clipboard, you can try the following command to convert those data into an R data frame:
manual_entry_clipboard <- read.table(file = "clipboard",
sep = "t", # separator could be tab, or commas, etc.
header=TRUE) # if there is a header row.Rdata files store R objects, and can actually store multiple R objects within one file, for example multiple dataframes, model results, lists, etc. This can be very useful to consolidate or share your data.
rio::export(my_list, my_dataframe, my_vector, "my_objects.Rdata")If you have a list and you want it to be imported with the original structure (e.g. list of lists), use import_list():
rio::import_list("my_list_of_lists.Rdata")With rio, you can use the export() function in a very similar way to import(). First give the name of the R object you want to save (e.g. linelist) and then in quote the filepath including name and file extension. For example:
export(linelist, "my_linelist.xlsx") # will save to working directoryYou could save the same dataframe as a .csv, and to a folder specified by here relative pathway:
export(linelist, here("data","clean", "my_linelist.csv")You can also export/save R dataframes as .rds files. The convenient thing about these is that classes of columns are retained, so you have less cleaning to do when importing than with an Excel or even a CSV file.
export(linelist, here("data","clean", "my_linelist.rds")How to save plots, such as those created by ggplot() is discussed in the ggplot tips page. In brief, run ggsave("my_plot_filepath_and_name.png") after printing your plot.
How to save a network graph, such as a transmission tree, is addressed in the page on Transmission chains.
An R project enables your work to be bundled in a self-contained, self-sufficient folder associated with a working directory. The project can include all the relevant data files, figures/outputs, scripts, and history, and can be shared.
To create an R project, select “New Project” from the File menu.
The R project you create will come in the form of a folder containing a .Rproj file. This file is a shortcut and likely the primary way you will open your project. You can also open a project by selecting “Open Project” from the File menu. Alternatively on the far upper right side of RStudio you will see an R project icon and a drop-down menu of available R projects.
To exit from an R project, either open a new project, or close the project (File - Close Project).
It is generally advised that you start RStudio each time with a “clean slate” - that is, with nothing preserved from a previous session or workspace. This will mean that your objects and results will not persist session-to-session (you must re-create them by running your scripts). However, this will force you to write better scripts and will avoid significant pain in the long run. To do this, do the following:
It is common to have subfolders in your project. Consider having folders such as “data”, “scripts”, “figures”, “presentations”… also consider a version control system. It could be something as simple as having dates on the names of scripts (e.g. “transmission_analysis_2020-10-03.R”) and having a header at the top of each script with a description, tags, change log, and author list.
One tip is that you can search across an entire project or folder using the “Find in Files” tool (Edit menu). It can search and even replace strings across multiple files.
RStudio page on using R projects
Below is a long list of suggested packages for common epidemiological work in R. You can copy this code and use # symbols to remove any packages you do not want.
install.packages("pacman").Also, consider using the package conflicted to manage conflicts and masking of functions.
# List of common & useful epidemiology R packages
pacman::p_load(
# learning R
learnr,
# project and file management
here,
rio,
# package install and management
pacman,
renv,
remotes
# General data management
tidyverse,
#dplyr,
#tidyr,
#ggplot2,
linelist,
lubridate,
naniar,
# statistics
gtsummary,
# epidemic modeling
epicontacts,
# plots - general
#ggplot2, # included in tidyverse
cowplot, # combining plots
RColorBrewer # color scales
# plots - specific types
DiagrammeR,
incidence,
# gis
sf, # to manage spatial data using a Simple Feature format
tmap, # to produce simple maps, works for both interactive and static maps
OpenStreetMap, # to add OSM basemap in ggplot map
# routine reports
rmarkdown, # produce PDFs, Word Documents, Powerpoints, and HTML files
reportfactory, # Auto-organization of Rmarkdown outputs
# tables
knitr,
DT,
# phylogenetics
ggtree,
ape,
# interactive
plotly,
shiny,
)This page demonstrates common steps necessary to clean a dataset, starting with importing raw data and demonstrating a “pipe chain” of cleaning steps.
This page uses a simulated Ebola case linelist, which is referenced throughout the handbook.
Here are some of the functions described in this page:
%>% - pipe to pass the dataset from one function to the nextmutate() - to create, transform, and re-define columnsselect() - to select or re-name columnsrename() - to rename columnsacross() - to transform multiple columns at one timefilter() - to keep certain rowsadd_row() - to add row manuallyclean_names() - to standardize the syntax of column namesas.characer(), as.numeric(), as.Date(), etc. - to convert the class of a columnrecode() - to re-code values in a columncase_when() - to re-code values in a column using more complex logical criteriareplace_na(), na_if(), coalesce() - special functions for re-codingclean_data() - to re-code/clean using a data dictionaryage_categories() and cut() - to create categorical groups from a numeric columndistinct() - to de-duplicate rowsThis page proceeds through typical cleaning steps, adding them sequentially to a cleaning pipe chain.
In epidemiological analysis and data processing, cleaning steps are often performed linked together, sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another.
The chain often utilizes dplyr verb functions and the magrittr pipe operator %>%. The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).
In a cleaning pipeline the order of the steps is important. Cleaning steps might include:
Below are the packages used in the page to clean the data:
pacman::p_load(
rio, # importing data
here, # relative file pathways
janitor, # data cleaning
lubridate, # working with dates
epikit, # age_categories() function
tidyverse # data manipulation and visualization
)Here we import the raw .xlsx dataset using the import() function from the package rio, and save it as the dataframe linelist_raw. If your dataset is large and takes a long time to import, it can be useful to have the import command be separate from the pipe chain and the “raw” saved as a distinct file. This also allows easy comparison between the original and cleaned versions.
See the page on Importing and exporting data for more details and unusual situations, including:
linelist_raw <- import("linelist_raw.xlsx")You can view the first 50 rows of the the original “raw” dataset below:
You can use the package skimr and its function skim() to get an overview of the entire dataframe (see page on Descriptive analysis).
Scroll to the right to see that histograms of each numeric column are included.
skimr::skim(linelist_raw)| Name | linelist_raw |
| Number of rows | 6479 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| character | 17 |
| numeric | 8 |
| POSIXct | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| case_id | 4 | 1.00 | 6 | 6 | 0 | 5888 | 0 |
| date onset | 276 | 0.96 | 10 | 10 | 0 | 571 | 0 |
| outcome | 1449 | 0.78 | 5 | 7 | 0 | 2 | 0 |
| gender | 330 | 0.95 | 1 | 1 | 0 | 2 | 0 |
| hospital | 1472 | 0.77 | 5 | 36 | 0 | 13 | 0 |
| infector | 2291 | 0.65 | 6 | 6 | 0 | 2697 | 0 |
| source | 2291 | 0.65 | 5 | 7 | 0 | 2 | 0 |
| age | 105 | 0.98 | 1 | 2 | 0 | 75 | 0 |
| age_unit | 4 | 1.00 | 5 | 6 | 0 | 2 | 0 |
| fever | 239 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| chills | 239 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| cough | 239 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| aches | 239 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| vomit | 239 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| time_admission | 817 | 0.87 | 5 | 5 | 0 | 1092 | 0 |
| merged_header | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
| …28 | 0 | 1.00 | 1 | 1 | 0 | 1 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| generation | 4 | 1.00 | 16.579999999999998 | 5.72 | 0.000000000000000 | 13.000000000000000 | 16.000000000000000 | 20.00 | 37.00 | ▁▆▇▂▁ |
| lon | 4 | 1.00 | -13.230000000000000 | 0.02 | -13.270000000000000 | -13.250000000000000 | -13.230000000000000 | -13.22 | -13.21 | ▅▃▃▅▇ |
| lat | 4 | 1.00 | 8.470000000000001 | 0.01 | 8.449999999999999 | 8.460000000000001 | 8.470000000000001 | 8.48 | 8.49 | ▅▇▇▇▆ |
| row_num | 0 | 1.00 | 3240.000000000000000 | 1870.47 | 1.000000000000000 | 1620.500000000000000 | 3240.000000000000000 | 4859.50 | 6479.00 | ▇▇▇▇▇ |
| wt_kg | 4 | 1.00 | 53.039999999999999 | 18.57 | -9.000000000000000 | 41.000000000000000 | 55.000000000000000 | 66.00 | 115.00 | ▁▃▇▅▁ |
| ht_cm | 4 | 1.00 | 125.310000000000002 | 49.48 | 7.000000000000000 | 91.000000000000000 | 130.000000000000000 | 159.00 | 292.00 | ▂▅▇▂▁ |
| ct_blood | 4 | 1.00 | 21.250000000000000 | 1.68 | 16.000000000000000 | 20.000000000000000 | 22.000000000000000 | 22.00 | 26.00 | ▁▃▇▃▁ |
| temp | 134 | 0.98 | 38.600000000000001 | 0.96 | 35.600000000000001 | 38.299999999999997 | 38.899999999999999 | 39.30 | 40.70 | ▁▂▃▇▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| infection date | 2290 | 0.65 | 2012-04-22 | 2015-04-27 | 2014-10-03 | 534 |
| hosp date | 4 | 1.00 | 2012-04-29 | 2015-04-30 | 2014-10-15 | 565 |
| date_of_outcome | 1033 | 0.84 | 2012-05-17 | 2015-06-04 | 2014-10-25 | 564 |
Column names are used very often, so they must have “clean” syntax. We suggest the following:
The columns names of linelist_raw are printed below using names() from base R. We can see that:
names(linelist_raw)## [1] "case_id" "generation" "infection date" "date onset" "hosp date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector" "source" "age" "age_unit" "row_num"
## [17] "wt_kg" "ht_cm" "ct_blood" "fever" "chills" "cough" "aches" "vomit"
## [25] "temp" "time_admission" "merged_header" "...28"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. note that on your keyboard, the back-tick (`) is different from the single quotation mark (’).
The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:
case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)replace = argument (e.g. replace = c(onset = “date_of_onset”))Below, the cleaning pipeline begins by using clean_names() on the raw linelist.
# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)## [1] "case_id" "generation" "infection_date" "date_onset" "hosp_date" "date_of_outcome" "outcome" "gender"
## [9] "hospital" "lon" "lat" "infector" "source" "age" "age_unit" "row_num"
## [17] "wt_kg" "ht_cm" "ct_blood" "fever" "chills" "cough" "aches" "vomit"
## [25] "temp" "time_admission" "merged_header" "x28"
NOTE: The last column name “…28” was changed to “x28”.
Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.
Below, a re-name command is added to the cleaning pipeline:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the columns names have been changed:
## [1] "case_id" "generation" "date_infection" "date_onset" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "row_num" "wt_kg" "ht_cm"
## [19] "ct_blood" "fever" "chills" "cough" "aches" "vomit"
## [25] "temp" "time_admission" "merged_header" "x28"
You can also rename by column position, instead of column name, for example:
rename(newNameForFirstColumn = 1,
newNameForSecondColumn = 2)select()You can also rename columns within the dplyr select() function, which is used to retain only certain columns (and covered later in this page). This approach also uses the format new_name = old_name. Here is an example:
linelist_raw %>%
select(# NEW name # OLD name
date_infection = `infection date`, # rename and KEEP ONLY these columns
date_hospitalisation = `hosp date`)If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see example above), or their name (linelist_raw$...1).
Merged cells in an Excel file are a common occurrence when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accommodate merged cells.
Remind people doing data entry that human-readable data is not the same as machine-readable data. Strive to train users about the principles of tidy data. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.
When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.
One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.
linelist_raw <- openxlsx::readWorkbook("linelist_raw.xlsx", fillMergedCells = TRUE)DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..
Use select() to select the columns you want to retain, and their order in the dataframe.
CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.
Here are ALL the column names in the linelist:
names(linelist)## [1] "case_id" "generation" "date_infection" "date_onset" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "row_num" "wt_kg" "ht_cm"
## [19] "ct_blood" "fever" "chills" "cough" "aches" "vomit"
## [25] "temp" "time_admission" "merged_header" "x28"
Select only the columns you want to remain
Put their names in the select() command, with no quotation marks. They will appear in the order you provide. Note that if you include a column that does not exist, R will return an error (see any_of below if you want no error in this situation).
# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names() # display the column names## [1] "case_id" "date_onset" "date_hospitalisation" "fever"
Helper functions and operators exist to make it easy to specify columns.
For example, if you want to re-order the columns, everything() is useful to signify all other columns not yet mentioned. The command below pulls columns date_onset and date_hospitalisation to the beginning:
# move date_onset and date_hospitalisation to beginning
linelist %>%
select(date_onset, date_hospitalisation, everything()) %>%
names()## [1] "date_onset" "date_hospitalisation" "case_id" "generation" "date_infection" "date_outcome"
## [7] "outcome" "gender" "hospital" "lon" "lat" "infector"
## [13] "source" "age" "age_unit" "row_num" "wt_kg" "ht_cm"
## [19] "ct_blood" "fever" "chills" "cough" "aches" "vomit"
## [25] "temp" "time_admission" "merged_header" "x28"
As well as everything() here are other helpers functions that work within select():
everything() - all other columns not mentionedlast_col() - the last columnwhere() - applies a function to all columns and selects those which are TRUEstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - columns containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() - a numerical range like x01, x02, x03any_of() - matches IF column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))In addition, use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.
Use where() to specify logical criteria for columns. If providing a function inside where(), do not include the empty parentheses. Below selects columns that are class Numeric.
# select columns that are class Numeric
linelist %>%
select(where(is.numeric)) %>%
names()## [1] "generation" "lon" "lat" "row_num" "wt_kg" "ht_cm" "ct_blood" "temp"
Use contains() to select only columns in which the column name contains a string. ends_with() and starts_with() provide more nuance.
# select columns containing certain characters
linelist %>%
select(contains("date")) %>%
names()## [1] "date_infection" "date_onset" "date_hospitalisation" "date_outcome"
The function matches() works similarly to contains() but can be provided a regular expression (see page on Characters and strings), such as multiple strings separated by OR bars within the parentheses:
# searched for multiple character matches
linelist %>%
select(matches("onset|hosp|fev")) %>% # note the OR symbol "|"
names()## [1] "date_onset" "date_hospitalisation" "hospital" "fever"
CAUTION: If a column name that you specifically provide does not exist in the data, it can return an error and stop your code. Consider using any_of() to cite columns that may or may not exist, especially useful in negative (remove) selections.
Only one of these columns exists, but no error is produced and the code continues.
linelist %>%
select(any_of(c("date_onset", "village_origin", "village_detection", "village_residence", "village_travel"))) %>%
names()## [1] "date_onset"
Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained.
linelist %>%
select(-c(date_onset, fever:vomit)) %>% # remove onset and all from fever to vomit
names()## [1] "case_id" "generation" "date_infection" "date_hospitalisation" "date_outcome" "outcome"
## [7] "gender" "hospital" "lon" "lat" "infector" "source"
## [13] "age" "age_unit" "row_num" "wt_kg" "ht_cm" "ct_blood"
## [19] "temp" "time_admission" "merged_header" "x28"
select() can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.
# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))
# display the column names
names(linelist_age)## [1] "case_id" "age" "age_unit"
In the linelist_raw, there are a few columns we do not need: row_num, merged_header, and x28. Remove them by adding a select() command to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
#####################################################
# remove column
select(-c(row_num, merged_header, x28))See the handbook page on De-duplication. Only a very simple de-duplication example is presented here.
The package dplyr offers the distinct() function to reduce the dataframe to only unique rows - removing rows that are 100% duplicates. We just add the simple command distinct() to the pipe chain:
We begin with 6479 rows in linelist.
linelist <- linelist %>%
distinct()After de-duplication there are 6479 rows. So there were rows that were 100% duplicates of other rows.
Below, the distinct() command is added to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_header, x28)) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
#####################################################
# de-duplicate
distinct()The verb mutate() is used to add a new column, or to modify an existing one. Below is an example of creating a new columns with mutate(). The syntax is: mutate(new_column_name = value or transformation)
The most basic mutate() command to create a new column might look like this. It creates a new column new_col where the value in every row is 10.
linelist <- linelist %>%
mutate(new_col = 10)You can also reference values in other columns, to perform calculations. For example below the Body Mass Index (BMI) is calculated using the formula BMI = kg/m^2, using column ht_cm and column wt_kg.
linelist <- linelist %>%
mutate(bmi = wt_kg / (ht_cm/100)^2)If creating multiple new columns, separate each with a comma and new line. Below, are examples of ways new columns, including pasting together values from other columns using str_glue() from the stringr package:
linelist <- linelist %>%
mutate(
new_var_dup = case_id, # new column = duplicate/copy another existing column
new_var_static = 7, # new column = all values the same
new_var_static = new_var_static + 5, # you can overwrite a column, and it can be a calculation using other variables
new_var_paste = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
) Scroll to the right to see the new columns (first 50 rows shown):
TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.
Often you will need to set the correct class for a column. There are ways to set column class during the import commands, but often this is often cumbersome. See section on object classes to learn more about converting the class of objects, including columns.
First, run some checks on important columns to see if they are the correct class:
Currently, the class of the “age” column is character. To perform quantitative analyses, we need these numbers to be recognized as numeric!
class(linelist$age)## [1] "character"
The class of the “date_onset” column is also character! To perform analyses, these dates must be recognized as dates!
class(linelist$date_onset)## [1] "character"
In this case, use mutate() to define the column as itself, but converted to a different class. Here is a basic example, converting or ensuring that the column age is class Numeric:
linelist <- linelist %>%
mutate(age = as.numeric(age))Examples of other converting functions:
# Examples of modifying class
linelist <- linelist %>%
mutate(date_var = as.Date(date_var, format = "MM/DD/YYYY"), # See page on Dates for details
numeric_var = as.numeric(numeric_var),
character_var = as.character(character_var),
factor_var = factor(factor_var, levels = c(...), labels = c(...)) # See page on Factors for details
)Dates can be especially difficult! The date values must all be in the same format for conversion to work correctly (e.g “MM/DD/YYYY”, or “DD Mmm YYYY”). See the page on Working with Dates (LINK) for details. Especially after converting to class date, check your data visually or with a cross-table to confirm that each value was converted correctly. For as.Date(), the format = argument is often a source of errors.
If your dataframe is already grouped (see page on Grouping data), mutate() may behave differently than if the dataframe is not grouped. Any summarizing functions, like mean(), median(), max(), etc. will be based on only the grouped rows, not all the rows.
# age normalized to mean of ALL rows
linelist %>%
select(case_id, age, hospital) %>%
mutate(age_norm = age / mean(age, na.rm=T))
# age normalized to mean of hospital group
linelist %>%
select(case_id, age, hospital) %>%
group_by(hospital) %>%
mutate(age_norm = age / mean(age, na.rm=T))Read more about using mutate on grouped dataframes in this tidyverse mutate documentation.
Often to write concise code you want to apply the same transformation to multiple columns at once. A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).
across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise().
across() allows you to specify which columns you want a function to apply to. To specify the columns, you can name them indvidually, or use helped functions.
Here the transformation as.character() is applied to specific columns named within across(). Note that functions in across() are written without their parentheses ( )
linelist <- linelist %>%
mutate(across(c(temp, ht_cm, wt_kg), as.character))There are helpers available to assist you in specifying columns:
everything() - all other columns not mentionedlast_col() - the last columnwhere() - applies a function to all columns and selects those which are TRUEstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - columns containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() -any_of() - matches if column is named. Useful if the name might not exist. Example: any_of(date_onset, date_death, cardiac_arrest)Here is an example of how one would change all columns to character class:
#to change all columns to character class
linelist <- linelist %>%
mutate(across(everything(), as.character))Columns where the name contains the string “date” (note placement of commas and parentheses):
#to change all columns to character class
linelist <- linelist %>%
mutate(across(contains("date"), as.character))Below, we want to mutate the columns where they are class POSIXct (a datetime class that shows timestamps) - where the function is.POSIXct() evaluates to TRUE. Then we want to apply the function is.Date() to of these column to convert them to class Date.
linelist <- linelist %>%
mutate(across(where(lubridate::is.POSIXct), as.Date))across() we also use the function where()is.character(), is.numeric(), and is.logical()) are from base RHere are a few online resources on using across(): creator Hadley Wickham’s thoughts/rationale
coalesce()This dplyr function finds the first non-missing value at each position.
Say you have two vectors, one for village of detection and another for village of residence. You can use coalesce to pick the first non-missing value for each index:
village_detection <- c("a", "b", NA, NA)
village_residence <- c("a", "c", "a", "d")
coalesce(village_detection, village_residence)## [1] "a" "b" "a" "d"
If you provide dataframe columns, for each row it will fill the value with the first non-missing value in the columns you provided.
linelist <- linelist %>%
mutate(village = coalesce(village_detection, village_residence))For more complicated row-wise calculations, see the section on Row-wise calculations.
If you a column to reflect the cumulative sum/mean/min/max etc as assessed down the rows of a dataframe, use the following functions:
cumsum() returns the cumulative sum, as shown below:
sum(c(2,4,15,10)) # returns only one number## [1] 31
cumsum(c(2,4,15,10)) # returns the cumulative sum at each step## [1] 2 6 21 31
This can be used in a dataframe when making a new column. For example, to calculate the cumulative number of cases per day in an outbreak, consider code like this:
cumulative_case_counts <- linelist %>%
count(date_onset) %>% # count of rows per day
mutate(
cumulative_cases = cumsum(n) # new column of the cumulative sum at that row
)Below the first 10 rows:
head(cumulative_case_counts, 10)## date_onset n cumulative_cases
## 1 2012-04-26 1 1
## 2 2012-06-15 1 2
## 3 2012-06-16 1 3
## 4 2012-06-23 1 4
## 5 2012-06-24 1 5
## 6 2012-07-04 1 6
## 7 2012-07-08 1 7
## 8 2012-07-14 1 8
## 9 2012-07-15 1 9
## 10 2012-07-17 1 10
See the page on Epidemic curves for how to plot cumulative incidence with the epicurve.
See also:
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
To define a new column (or re-define a column) using base R, just use the assignment operator as below.
Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:
linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$ageBelow, a new column is added to the pipe chain and some classes are converted.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_header, x28)) %>%
# de-duplicate
distinct() %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# add new column
mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
# convert class of columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) Here are a few scenarios where you need to re-code (change) values:
To change values manually you can use the recode() function within the mutate() function.
Imagine there is a nonsensical date in the data (e.g. “2014-14-15”): you could fix the date in the source data, or, you could write the change into the cleaning pipeline via mutate() and recode().
# fix incorrect values # old value # new value
linelist <- linelist %>%
mutate(date_onset = recode(date_onset, "2014-14-15" = "2014-04-15"))The mutate() line above can be read as: “mutate the column date_onset to equal the column date_onset re-coded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this.
Here is another example re-coding multiple values within one column.
In linelist the values in the column “hospital” must be cleaned. There are several different spellings and many missing values.
table(linelist$hospital, useNA = "always")##
## Central Hopital Central Hospital Hospital A Hospital B
## 11 443 289 289
## Military Hopital Military Hospital Mitylira Hopital Mitylira Hospital
## 30 786 1 79
## Other Port Hopital Port Hospital St. Mark's Maternity Hospital (SMMH)
## 885 47 1725 411
## St. Marks Maternity Hopital (SMMH) <NA>
## 11 1472
The recode() command below re-defines the column “hospital” as the current column “hospital”, but with the specified recode changes. Don’t forget commas after each!
linelist <- linelist %>%
mutate(hospital = recode(hospital,
# reference: OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Port Hopital" = "Port Hospital",
"Central Hopital" = "Central Hospital",
"other" = "Other",
"St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
))Now we see the spellings in the hospital column have been corrected and consolidated:
table(linelist$hospital, useNA = "always")##
## Central Hospital Hospital A Hospital B Military Hospital
## 454 289 289 896
## Other Port Hospital St. Mark's Maternity Hospital (SMMH) <NA>
## 885 1772 422 1472
TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").
dplyr offers two special function for handling missing values:
replace_na()
To change missing values (NA) to a specific value, such as “Missing”, use the function replace_na() within mutate(). Note that this is used in the same manner as recode above - the name of the variable must be repeated within replace_na().
linelist <- linelist %>%
mutate(hospital = replace_na(hospital, "Missing"))na_if()
To convert a specific value to NA, use na_if(). The command below performs the opposite operation of replace_na(). In the example below, any values of “Missing” in the column hospital are converted to NA.
linelist <- linelist %>%
mutate(hospital = na_if(hospital, "Missing"))Note: na_if() cannot be used for logic criteria (e.g. “all values > 99”) - use replace() or case_when() for this:
# Convert temperatures above 40 to NA
linelist <- linelist %>%
mutate(temp = replace(temp, temp > 40, NA))
# Convert onset dates earlier than 2000 to missing
linelist <- linelist %>%
mutate(temp = replace(date_onset, date_onset > as.Date("2000-01-01"), NA))Below is demonstrated how to re-code values in a column using logic and conditions:
replace(), ifelse() and if_else() for simple logiccase_when() for more complex logicreplace()To re-code with simple logical criteria, you can use replace() within mutate(). replace() is a function from base R. Use a logic condition to specify the rows to change . The general syntax is:
mutate(col_to_change = replace(col_to_change, criteria for rows, new value)).
One common situation is changing one value in one row, using an unique row identifier. Below, the gender is changed to “Female” in the row where the column case_id is “2195”.
# Example: change gender of one specific observation to "Female"
linelist <- linelist %>%
mutate(gender = replace(gender, case_id == "2195", "Female")The equivalent command using base R syntax and the indexing brackets [ ] is below. It reads as “Change the value of the dataframe linelist‘s column gender (for the rows where linelist’s column case_id has the value ’2195’) to ‘Female’”.
linelist$gender[linelist$case_id == "2195"] <- "Female"ifelse() and if_else()Another tool for simple logical re-coding is ifelse() and its partner if_else(). However, in most cases it is better to use case_when() (for clarity).
These commands are simplified versions of an if and else programming statement (LINK). The general syntax is:
ifelse(condition, value to return if condition evaluates to TRUE, value to return if condition evaluates to FALSE)
Below, the column source_known is defined (or re-defined). Its value in a given row is set to “known” if the row’s value in column source is not missing. If the value in source is missing, then the value in source_known is set to “unknown”.
linelist <- linelist %>%
mutate(source_known = ifelse(!is.na(source), "known", "unknown"))if_else() is a special version from dplyr that handles dates. Note that if the ‘true’ value is a date, the ‘false’ value must also qualify a date, hence using the special character NA_real_ instead of just NA.
# Create a date of death column, which is NA if patient has not died.
linelist <- linelist %>%
mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))Avoid stringing together many ifelse commands… use case_when() instead! case_when() is much easier to read and you’ll make fewer errors.
Outside of the context of a dataframe, if you want to have an object used in your code switch its value based on criteria, consider using switch() from base R. See the section on using switch() in the page on R interactive console.
Use dplyr’s case_when() if you need to use complex logic statements to re-code values. There are important differences from recode() in syntax and logic order!
case_when() commands have a Right-Hand Side (RHS) and a Left-Hand Side (LHS) separated by a “tilde” ~. The logic criteria are in the LHS and the pursuant value is on the RHS. Statements are separated by commas. It is important to note that:
TRUE on the LHS, which signifies any row value that did not meet any of the previous criteriaNA, you may need to use special values such as NA_character_, NA_real_ (for numeric or POSIX), and as.Date(NA)Below we utilize the columns age and age_unit to create a column age_years:
linelist <- linelist %>%
mutate(age_years = case_when(
age_unit == "years" ~ age, # if age is given in years
age_unit == "months" ~ age/12, # if age is given in months
is.na(age_unit) ~ age, # if age unit is missing, assume years
TRUE ~ NA_real_)) # any other circumstance assign missingUse the package linelist to clean a linelist with a cleaning dictionary.
cleaning_dict <- import("cleaning_dict.csv")clean_data() as a numeric or logical vector, so you will see use of names(.) in the command below (the dot means the dataframe).protected_cols <- c("case_id", "source")clean_data(), specifying the cleaning dictionarylinelist <- linelist %>%
linelist::clean_data(
wordlists = cleaning_dict,
spelling_vars = "col", # dict column containing column names, defaults to 3rd column in dict
protect = names(.) %in% protected_cols
)Scroll too see how values have changed - particularly gender (lowercase to uppercase), and all the symptoms columns have been transformed from yes/no to 1/0.
CAUTION: clean_data() from linelist package will also clean values in your data unless those columns are protected - you may encounter changes to columns with dashes “-” or .
Note that your column names in the cleaning dictionary must correspond to the names at this point in your cleaning script. clean_data() itself also implements a column name cleaning function similar to clean_names() from janitor that standardizes column names prior to applying the dictionary.
See this online reference for the linelist package for more details.
Below, some new columns and column transformations are added to the pipe chain.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_header, x28)) %>%
# de-duplicate
distinct() %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# add column
mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
# convert class of columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# add column: delay to hospitalisation
mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>%
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Port Hopital" = "Port Hospital",
"Central Hopital" = "Central Hospital",
"other" = "Other",
"St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_))Here we describe some special approaches for creating numeric categories. Common examples include age categories, groups of lab values, etc. Here we will discuss:
age_categories(), from the epikit packagecut(), from base Rcase_when()For this example we will create an age_cat column using the age_years column.
#check the class of the linelist variable age
class(linelist$age_years)## [1] "numeric"
First, examine the distribution of your data, to make appropriate cut-points. See the page on how to Plot continuous data.
# examine the distribution
hist(linelist$age_years)summary(linelist$age_years, na.rm=T)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000000000 6.0000000000 13.0000000000 16.0212059408 23.0000000000 77.0000000000 105
CAUTION: Sometimes, numeric variables will import as class “character”. This occurs if there are non-numeric characters in some of the values, for example an entry of “2 months” for age, or (depending on your R locale settings) if a comma is used in the decimals place (e.g. “4,5” to mean four and one half years)..
age_categories()With the epikit package, you can use the age_categories() function to easily categorize and label numeric columns (note: this function can be applied to non-age numeric variables too). Of note: the output is an ordered factor.
Here are the required inputs:
breakers = - a numeric vector of break points for the new groupsFirst, the most simple example:
# Simple example
################
pacman::p_load(epikit)
linelist <- linelist %>%
mutate(
age_cat = age_categories(
age_years,
breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70)))
# show table
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-69 70+ <NA>
## 1192 1177 1010 923 1167 525 268 78 27 7 105
The break values you specify are by default included in the “higher” group - groups are “open” on the lower/left side. As shown below, you can add 1 to each break value to achieve groups that are open at the top/right.
# Include upper ends for the same categories
############################################
linelist <- linelist %>%
mutate(
age_cat = age_categories(
age_years,
breakers = c(0, 6, 11, 16, 21, 31, 41, 51, 61, 71)))
# show table
table(linelist$age_cat, useNA = "always")##
## 0-5 6-10 11-15 16-20 21-30 31-40 41-50 51-60 61-70 71+ <NA>
## 1420 1165 1001 873 1097 481 234 77 20 6 105
You can adjust how the labels are displayed with separator =. The default is “-”
You can adjust the upper cut-off of values allowed to be included in a group. Use ceiling =, the default is FALSE. If TRUE, the highest break value is a “ceiling” and a category “XX+” is not included. Any values above highest break value or upper (if defined) are categorized as NA. Below is an example with ceiling = TRUE, so that there is no category of XX+ and values above 70 (the highest break value) are assigned as NA.
# With ceiling set to TRUE
##########################
linelist <- linelist %>%
mutate(
age_cat = age_categories(
age_years,
breakers = c(0, 5, 10, 15, 20, 30, 40, 50, 60, 70),
ceiling = TRUE)) # 70 is ceiling, all above become NA
# show table
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-39 40-49 50-59 60-70 <NA>
## 1192 1177 1010 923 1167 525 268 78 28 111
Alternatively, instead of breakers =, you can provide all of lower =, upper =, and by =:
lower = The lowest number you want considered - default is 0upper = The highest number you want consideredby = The number of years between groupslinelist <- linelist %>%
mutate(
age_cat = age_categories(
age_years,
lower = 0,
upper = 100,
by = 10))
# show table
table(linelist$age_cat, useNA = "always")##
## 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 100+ <NA>
## 2369 1933 1167 525 268 78 27 7 0 0 0 105
See the function’s Help page for more details (enter ?age_categories in the R console).
cut()You can also use the base R function cut(), which creates categories from a numeric column. The differences from age_categories() are:
The basic syntax within cut() is to first provide the numeric variable to be cut (age_years), and then the breaks argument, which is a numeric vector (c()) of break points. Using cut(), the resulting column is an ordered factor. If used within mutate() (a dplyr verb) it is not necessary to specify the dataframe before the column name (e.g. linelist$age_years).
Create new column of age categories (age_cat) by cutting the numeric age_year column at specified break points.
cut() is that lower break values are excluded from each category, and upper break values are included. This is the opposite behavior from the age_categories() function.include.lowest = TRUElabels = argumentBelow is a detailed description of the behavior of using cut() to make the age_cat column. Key points:
A simple example of cut() applied to age_years to make the new variable age_cat is below:
# Create new variable, by cutting the numeric age variable
# by default, upper break is excluded and lower break excluded from each category
linelist <- linelist %>%
mutate(
age_cat = cut(
age_years,
breaks = c(0, 5, 10, 15, 20,
30, 50, 70, 100),
include.lowest = TRUE # include 0 in lowest group
))
# tabulate the number of observations per group
table(linelist$age_cat, useNA = "always")##
## [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
## 1420 1165 1001 873 1097 715 97 6 105
By default, the categorization occurs so that the right/upper side is “open” and inclusive (and the left/lower side is “closed” or exclusive). The default labels use the notation “(A, B]”, which means the group does not include A (the lower break value), but includes B (the upper break value). Reverse this behavior by providing the right = TRUE argument.
Thus, by default “0” values are excluded from the lowest group, and categorized as NA. “0” values could be infants coded as age 0. To change this add the argument include.lowest = TRUE. Then, any “0” values are included in the lowest group. The automatically-generated label for the lowest category will change from “(0,B]” to “[0,B]”, which signifies that 0 values are included.
Check your work!!! Verify that each age value was assigned to the correct category by cross-tabulating the numeric and category columns. Examine assignment of boundary values (e.g. 15, if neighboring categories are 10-15 and 15-20).
# Cross tabulation of the numeric and category columns.
table("Numeric Values" = linelist$age_years, # names specified in table for clarity.
"Categories" = linelist$age_cat,
useNA = "always") # don't forget to examine NA values## Categories
## Numeric Values [0,5] (5,10] (10,15] (15,20] (20,30] (30,50] (50,70] (70,100] <NA>
## 0 143 0 0 0 0 0 0 0 0
## 0.0833333333333333 1 0 0 0 0 0 0 0 0
## 0.166666666666667 2 0 0 0 0 0 0 0 0
## 0.25 2 0 0 0 0 0 0 0 0
## 0.333333333333333 3 0 0 0 0 0 0 0 0
## 0.416666666666667 4 0 0 0 0 0 0 0 0
## 0.5 3 0 0 0 0 0 0 0 0
## 0.75 3 0 0 0 0 0 0 0 0
## 0.833333333333333 2 0 0 0 0 0 0 0 0
## 0.916666666666667 4 0 0 0 0 0 0 0 0
## 1 241 0 0 0 0 0 0 0 0
## 1.5 1 0 0 0 0 0 0 0 0
## 2 261 0 0 0 0 0 0 0 0
## 3 272 0 0 0 0 0 0 0 0
## 4 250 0 0 0 0 0 0 0 0
## 5 228 0 0 0 0 0 0 0 0
## 6 0 223 0 0 0 0 0 0 0
## 7 0 241 0 0 0 0 0 0 0
## 8 0 252 0 0 0 0 0 0 0
## 9 0 233 0 0 0 0 0 0 0
## 10 0 216 0 0 0 0 0 0 0
## 11 0 0 250 0 0 0 0 0 0
## 12 0 0 188 0 0 0 0 0 0
## 13 0 0 186 0 0 0 0 0 0
## 14 0 0 170 0 0 0 0 0 0
## 15 0 0 207 0 0 0 0 0 0
## 16 0 0 0 171 0 0 0 0 0
## 17 0 0 0 212 0 0 0 0 0
## 18 0 0 0 150 0 0 0 0 0
## 19 0 0 0 183 0 0 0 0 0
## 20 0 0 0 157 0 0 0 0 0
## 21 0 0 0 0 123 0 0 0 0
## 22 0 0 0 0 154 0 0 0 0
## 23 0 0 0 0 128 0 0 0 0
## 24 0 0 0 0 134 0 0 0 0
## 25 0 0 0 0 117 0 0 0 0
## 26 0 0 0 0 95 0 0 0 0
## 27 0 0 0 0 99 0 0 0 0
## 28 0 0 0 0 79 0 0 0 0
## 29 0 0 0 0 81 0 0 0 0
## 30 0 0 0 0 87 0 0 0 0
## 31 0 0 0 0 0 69 0 0 0
## 32 0 0 0 0 0 71 0 0 0
## 33 0 0 0 0 0 53 0 0 0
## 34 0 0 0 0 0 43 0 0 0
## 35 0 0 0 0 0 41 0 0 0
## 36 0 0 0 0 0 46 0 0 0
## 37 0 0 0 0 0 39 0 0 0
## 38 0 0 0 0 0 46 0 0 0
## 39 0 0 0 0 0 30 0 0 0
## 40 0 0 0 0 0 43 0 0 0
## 41 0 0 0 0 0 37 0 0 0
## 42 0 0 0 0 0 35 0 0 0
## 43 0 0 0 0 0 30 0 0 0
## 44 0 0 0 0 0 16 0 0 0
## 45 0 0 0 0 0 24 0 0 0
## 46 0 0 0 0 0 27 0 0 0
## 47 0 0 0 0 0 17 0 0 0
## 48 0 0 0 0 0 19 0 0 0
## 49 0 0 0 0 0 20 0 0 0
## 50 0 0 0 0 0 9 0 0 0
## 51 0 0 0 0 0 0 17 0 0
## 52 0 0 0 0 0 0 13 0 0
## 53 0 0 0 0 0 0 9 0 0
## 54 0 0 0 0 0 0 10 0 0
## 55 0 0 0 0 0 0 3 0 0
## 56 0 0 0 0 0 0 8 0 0
## 57 0 0 0 0 0 0 4 0 0
## 58 0 0 0 0 0 0 2 0 0
## 59 0 0 0 0 0 0 3 0 0
## 60 0 0 0 0 0 0 8 0 0
## 61 0 0 0 0 0 0 2 0 0
## 62 0 0 0 0 0 0 2 0 0
## 63 0 0 0 0 0 0 2 0 0
## 64 0 0 0 0 0 0 5 0 0
## 65 0 0 0 0 0 0 2 0 0
## 66 0 0 0 0 0 0 2 0 0
## 68 0 0 0 0 0 0 1 0 0
## 69 0 0 0 0 0 0 3 0 0
## 70 0 0 0 0 0 0 1 0 0
## 71 0 0 0 0 0 0 0 2 0
## 73 0 0 0 0 0 0 0 1 0
## 74 0 0 0 0 0 0 0 1 0
## 75 0 0 0 0 0 0 0 1 0
## 77 0 0 0 0 0 0 0 1 0
## <NA> 0 0 0 0 0 0 0 0 105
Reverse break inclusion behavior in cut()
Lower break values will be included in each category (and upper break values excluded) if the argument right = is included and and set to TRUE. This is applied below - note how the values have shifted among the categories.
NOTE: If you include the include.lowest = TRUE argument and right = TRUE, the extreme inclusion will now apply to the highest break point value and category, not the lowest.
linelist <- linelist %>%
mutate(
age_cat = cut(
age_years,
breaks = c(0, 5, 10, 15, 20,
30, 50, 70, 100), # same breaks as above
right = FALSE, # include each *lower* break point
include.lowest = TRUE # include *highest* value *highest* group
))
table(linelist$age_cat, useNA = "always")##
## [0,5) [5,10) [10,15) [15,20) [20,30) [30,50) [50,70) [70,100] <NA>
## 1192 1177 1010 923 1167 793 105 7 105
Add labels
As these are manually written, be very careful to ensure they are accurate! Check your work using cross-tabulation, as described below. Below is the same code as above, with manual labels added.
linelist <- linelist %>%
mutate(
age_cat = cut(
age_years,
breaks = c(0, 5, 10, 15, 20,
30, 50, 70, 100), # same breaks as above
right = FALSE, # include each *lower* break point
include.lowest = TRUE, # include *highest* value *highest* group
labels = c("0-4", "5-9", "10-14",
"15-19", "20-29", "30-49",
"50-69", "70-100")
))
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70-100 <NA>
## 1192 1177 1010 923 1167 793 105 7 105
Re-labeling NA values with cut()
Because cut() does not automatically label NA values, you may want to assign a label such as “Missing”. This requires a few extra steps because cut() automatically classified the new column age_cat as class Factor (a rigid class limited to the defined values).
First, convert age_cut from Factor to Character class, so you have flexibility to add new character values (e.g. “Missing”). Otherwise you will encounter an error. Then, use the dplyr verb replace_na() to replace NA values with a character value like “Missing”. These steps can be combined into one step, as shown below.
Note that Missing has been added, but the order of the categories is now wrong (alphabetical considering numbers as characters).
linelist <- linelist %>%
# cut() creates age_cat, automatically of class Factor
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),
right = FALSE,
include.lowest = TRUE,
labels = c("0-4", "5-9", "10-14", "15-19",
"20-29", "30-49", "50-69", "70-100")),
# convert to class Character, and replace NA with "Missing"
age_cat = replace_na(as.character(age_cat), "Missing"))
table(linelist$age_cat, useNA = "always")##
## 0-4 10-14 15-19 20-29 30-49 5-9 50-69 70-100 Missing <NA>
## 1192 1010 923 1167 793 1177 105 7 105 0
To fix this, re-convert age_cat to a factor, and define the order of the levels correctly.
linelist <- linelist %>%
# cut() creates age_cat, automatically of class Factor
mutate(age_cat = cut(age_years,
breaks = c(0, 5, 10, 15, 20, 30, 50, 70, 100),
right = FALSE,
include.lowest = TRUE,
labels = c("0-4", "5-9", "10-14", "15-19",
"20-29", "30-49", "50-69", "70-100")),
# convert to class Character, and replace NA with "Missing"
age_cat = replace_na(as.character(age_cat), "Missing"),
# re-classify age_cat as Factor, with correct level order and new "Missing" level
age_cat = factor(age_cat, levels = c("0-4", "5-9", "10-14", "15-19", "20-29",
"30-49", "50-69", "70-100", "Missing")))
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70-100 Missing <NA>
## 1192 1177 1010 923 1167 793 105 7 105 0
If the above seems cumbersome, consider using age_categories() instead, as described before.
Make breaks and labels
For a fast way to make breaks and labels manually, use something like below. See the R Basics page for references on seq() and rep().
# Make break points from 0 to 90 by 5
age_seq = seq(from = 0, to = 90, by = 5)
age_seq
# Make labels for the above categories, assuming default cut() settings
age_labels = paste0(age_seq+1, "-", age_seq + 5)
age_labels
# check that both vectors are the same length
length(age_seq) == length(age_labels)Read more about cut() in its Help page by entering ?cut in the R console.
Make breaks from quantile(). This is from the stats package which comes in base R.
age_quantiles <- quantile(linelist$age_years, c(0, .25, .50, .75, .90, .95), na.rm=T)
age_quantiles## 0% 25% 50% 75% 90% 95%
## 0 6 13 23 33 41
# to return only the numbers use unname()
age_quantiles <- unname(age_quantiles)
age_quantiles## [1] 0 6 13 23 33 41
You can then use these as break points in age_categories() or cut().
case_when()The dplyr function case_when() can also be used to create numeric categories.
NA values in one stepIf using case_when() please review the proper use as described earlier in this page, as logic and order of assignment are important understand to avoid errors.
CAUTION: In case_when() all right-hand side values must be of the same class. Thus, if your categories are character values (e.g. “20-30 years”) then any designated outcome for NA age values must also be character (either “Missing”, or the special NA_character_ instead of NA).
You will need to designate the column as a factor (by wrapping case_when() in the function factor()) and provide the ordering of the factor levels using the levels = argument after the close of the case_when() function. When using cut(), the factor and ordering of levels is done automatically.
linelist <- linelist %>%
mutate(
age_cat = factor(case_when(
# provide the case_when logic and outcomes
age_years >= 0 & age_years < 5 ~ "0-4",
age_years >= 5 & age_years < 10 ~ "5-9",
age_years >= 10 & age_years < 15 ~ "10-14",
age_years >= 15 & age_years < 20 ~ "15-19",
age_years >= 20 & age_years < 30 ~ "20-29",
age_years >= 30 & age_years < 50 ~ "30-49",
age_years >= 50 & age_years < 70 ~ "50-69",
age_years >= 45 & age_years <= 100 ~ "70-100",
is.na(age_years) ~ "Missing", # if age_years is missing
TRUE ~ "Check value"), # trigger for review
# define the levels order for factor()
levels = c("0-4","5-9", "10-14",
"15-19", "20-29", "30-49",
"50-69", "70-100", "Missing", "Check value")))And now view the results with a table of the new column:
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70-100 Missing Check value <NA>
## 1192 1177 1010 923 1167 793 105 7 105 0 0
Below, code to create two categorical age columns is added to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_header, x28)) %>%
# de-duplicate
distinct() %>%
# add column
mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
# convert class of columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# add column: delay to hospitalisation
mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>%
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Port Hopital" = "Port Hospital",
"Central Hopital" = "Central Hospital",
"other" = "Other",
"St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_)) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
mutate(
# age categories: custom
age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
# age categories: 0 to 85 by 5s
age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5)))Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.
linelist <- linelist %>%
add_row(row_num = 666,
case_id = "abc",
generation = 4,
`infection date` = as.Date("2020-10-10"),
.before = 2)Use .before and .after. to place the row you want to add. .before = 3 will put the new row before the 3rd row. The default behavior is to add the row to the end. Columns not specified will be left empty.
The new row number may look strange (“…23”) but the row numbers in the pre-existing rows have changed. So if using the command twice, examine/test the insertion carefully.
If a class you provide is off you will see an error like this:
Error: Can't combine ..1$infection date <date> and ..2$infection date <character>.
(when inserting a row with a date value, remember to wrap the date in the function as.Date() like as.Date("2020-10-10")).
A typical early cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.
Below is shown how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R
filter()This simple example re-defines the dataframe linelist as itself, having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept.
In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing (NA). Thus, rows where case_id is not missing are kept.
Before the filter is applied, the number of rows in linelist is 6479.
linelist <- linelist %>%
filter(!is.na(case_id)) # keep only rows where case_id is not missingAfter the filter is applied, the number of rows in linelist is 6475.
filter()A more complex example using filter():
Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.
hist(linelist$date_onset, breaks = 50)Can we just filter by date_onset to rows after June 2013? Caution! Applying the code filter(date_onset > as.Date("2013-06-01"))) would remove any rows in the later epidemic with a missing date of onset!
DANGER: Filtering to greater than (>) or less than (<) a date or number can remove any rows with missing values (NA)! This is because NA is treated as infinitely large and small.
Examine a cross-tabulation to make sure we exclude only the correct rows:
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
useNA = "always") # show missing values## YearOnset
## Hospital 2012 2013 2014 2015 <NA>
## Central Hospital 0 0 333 95 26
## Hospital A 226 48 0 0 15
## Hospital B 233 39 0 0 16
## Military Hospital 0 0 666 197 33
## Missing 0 0 1099 309 61
## Other 0 0 676 172 37
## Port Hospital 7 3 1361 337 64
## St. Mark's Maternity Hospital (SMMH) 0 0 311 91 20
## <NA> 0 0 0 0 0
What other criteria can we filter on to remove the first outbreak (in 2012 & 2013) from the dataset? We see that:
We want to exclude:
We start with a linelist of nrow(linelist). Here is our filter statement:
linelist <- linelist %>%
# keep rows where onset is after 1 June 2013 OR where onset is missing and it was a hospital OTHER than Hospital A or B
filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
nrow(linelist)## [1] 5888
When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, and the 10 Port Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of date_onset
useNA = "always") # show missing values## YearOnset
## Hospital 2014 2015 <NA>
## Central Hospital 333 95 26
## Military Hospital 666 197 33
## Missing 1099 309 61
## Other 676 172 37
## Port Hospital 1361 337 64
## St. Mark's Maternity Hospital (SMMH) 311 91 20
## <NA> 0 0 0
Multiple statements can be included within one filter command (separated by commas), or you can always pipe to a separate filter() command for clarity.
Note: some readers may notice that it would be easier to just filter by date_hospitalisation because it is 100% complete with no missing values. This is true. But date_onset is used for purposes of demonstrating a complex filter.
Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.
# dataframe <- filter(dataframe, condition(s) for rows to keep)
linelist <- filter(linelist, !is.na(case_id))You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.
# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)
linelist <- linelist[!is.na(case_id), ]TIP: Use bracket-subset syntax with View() to quickly review a few records.
This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:
View the linelist in the viewer panel:
View(linelist)View specific data for three cases:
View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])Note: the above command can also be written with dplyr verbs filter() and select() as below:
View(linelist %>%
filter(case_id %in% c("11f8ea", "76b97a", "47a5f5")) %>%
select(date_onset, date_hospitalisation))# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
# begin cleaning pipe chain
###########################
linelist <- linelist_raw %>%
# standardize column name syntax
janitor::clean_names() %>%
# manually re-name columns
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-c(row_num, merged_header, x28)) %>%
# de-duplicate
distinct() %>%
# add column
mutate(bmi = wt_kg / (ht_cm/100)^2) %>%
# convert class of columns
mutate(across(contains("date"), as.Date),
generation = as.numeric(generation),
age = as.numeric(age)) %>%
# add column: delay to hospitalisation
mutate(days_onset_hosp = as.numeric(date_hospitalisation - date_onset)) %>%
# clean values of hospital column
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Port Hopital" = "Port Hospital",
"Central Hopital" = "Central Hospital",
"other" = "Other",
"St. Marks Maternity Hopital (SMMH)" = "St. Mark's Maternity Hospital (SMMH)"
)) %>%
mutate(hospital = replace_na(hospital, "Missing")) %>%
# create age_years column (from age and age_unit)
mutate(age_years = case_when(
age_unit == "years" ~ age,
age_unit == "months" ~ age/12,
is.na(age_unit) ~ age,
TRUE ~ NA_real_)) %>%
mutate(
# age categories: custom
age_cat = epikit::age_categories(age_years, breakers = c(0, 5, 10, 15, 20, 30, 50, 70)),
# age categories: 0 to 85 by 5s
age_cat5 = epikit::age_categories(age_years, breakers = seq(0, 85, 5))) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
filter(
# keep only rows where case_id is not missing
!is.na(case_id),
# also filter to keep only the second outbreak
date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))If you want to perform a calculation within a row, you can use rowwise() from dplyr. See the vignette on row-wise calculations
For example, this code applies rowwise() and then creates a new column that sums the number of symptoms per case:
linelist <- linelist %>%
rowwise() %>%
mutate(num_symptoms = sum(c(fever, chills, cough, aches, vomit) == "yes"))## [1] "2014-04-17"
## [1] "2014-04-19"
Working with dates in R is notoriously difficult when compared to other object classes. R often interprets dates as character objects - this means they cannot be used for general date operations such as making time series and calculating time intervals. To make matters more difficult, there are many date formats, some of which can be confused for other formats. Luckily, dates can be wrangled easily with practice, and with a set of helpful packages.
Dates in R are their own class of object - the Date class. It should be noted that there is also a class that stores objects with date and time. Date time objects are formally referred to as and/or POSIXt, POSIXct, and/or POSIXlt classes (the difference isn’t important). These objects are informally referred to as datetime classes.
You can get the system date or system datetime by doing the following:
# get the system date - this is a DATE class
Sys.Date()## [1] "2021-02-21"
# get the system time - this is a DATETIME class
Sys.time()## [1] "2021-02-21 15:39:43 EST"
The following packages are recommended for working with dates:
# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(aweek, # flexibly converts dates to weeks, and vis-versa
lubridate, # for conversions to months, years, etc.
linelist, # function to guess messy dates
ISOweek, # another option for creating weeks
tidyverse,
) ##
## Your package installed
## Warning in pacman::p_load(aweek, lubridate, linelist, ISOweek, tidyverse, : Failed to install/load:
The standard, base R function to convert an object or variable to class Date is as.Date() (note capitalization).
as.Date() requires that the user specify the existing* format of the date*, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().
If used on a variable, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them or consider using guess_dates() from the linelist package.
It can be easiest to first convert the variable to character class, and then convert to date class:
as.character()linelist_cleaned$date_of_onset <- as.character(linelist_cleaned$date_of_onset)as.Date()Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.
For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.
linelist_cleaned$date_of_onset <- as.Date(linelist_cleaned$date_of_onset, format = "%d/%m/%Y")TIP: The format = argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.
TIP:Be sure that in the format = argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.
Conveting character objects to dates can be made far easier by using the lubridate package. The lubridate package is a tidyverse package designed to make working with dates and time more simple and consistent than in base R. For these reasons, lubridate is often considered the gold-standard package for dates and time, and is recommended whenever working with them.
The lubridate package provides a number of different helper functions designed to convert character objects to dates in an intuitive, and more lenient way than specifying the format in as.Date(). These functions are specific to the rough date format, but allow for a variety of separators, and synonyms for dates (e.g. 01 vs Jan vs January) - they are named after abbreviations of date formats.
# load packages
library(lubridate)
# read date in year-month-day format
ymd("2020-10-11")## [1] "2020-10-11"
ymd("20201011")## [1] "2020-10-11"
# read date in month-day-year format
mdy("10/11/2020")## [1] "2020-10-11"
mdy("Oct 11 20")## [1] "2020-10-11"
# read date in day-month-year format
dmy("11 10 2020")## [1] "2020-10-11"
dmy("11 October 2020")## [1] "2020-10-11"
If using piping and the tidyverse, the converting a character column to dates might look like this:
linelist_cleaned <- linelist_cleaned %>%
mutate(date_of_onset = lubridate::dmy(date_of_onset))Once complete, you can run a command to verify the class of the variable
# Check the class of the variable
class(linelist_cleaned$date_of_onset) Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.
datetime classesAs previously mentioned, R also supports a datetime class - a variable that contains date and time information. As with the Date class, these often need to be converted from character objects to datetime objects.
A standard datetime object is formatted with the date first, which is followed by a time component - for example 01 Jan 2020, 16:30. As with dates, there are many ways this can be formatted, and there are numerous levels of precision (hours, minutes, seconds) that can be supplied. Luckily, lubridate helper functions also exist to help convert these strings to datetime objects. These functions are the same as the date helper functions, with _h (only hours supplied), _hm (hours and minutes supplied), or _hms (hours, minutes, and seconds supplied) appended to the end (e.g. dmy_hms()). These can be used as shown:
# convert datetime with only hours to datetime object
ymd_h("2020-01-01 16hrs")## [1] "2020-01-01 16:00:00 UTC"
ymd_h("2020-01-01 4PM")## [1] "2020-01-01 16:00:00 UTC"
# convert datetime with hours and minutes to datetime object
dmy_hm("Jan 1st 2020 16:20")## Warning: All formats failed to parse. No formats found.
## [1] NA
# convert datetime with hours, minutes, and seconds to datetime object
mdy_hms("01 January 20, 16:20:40")## Warning: All formats failed to parse. No formats found.
## [1] NA
# you can supply time zone but it is ignored
mdy_hms("01 January 20, 16:20:40 PST")## Warning: All formats failed to parse. No formats found.
## [1] NA
When working with a linelist, time and date columns can be combined to create a datetime column using these functions:
# time_admission is a variable in hours:minutes
linelist_cleaned <- linelist_cleaned %>%
# assume that when time of admission is not given, it the median admission time
mutate(
time_admission_clean = ifelse(
is.na(time_admission),
median(time_admission),
time_admission
) %>%
# use paste0 to combine two columns to create a character vector, and use ymd_hm() to convert to datetime
mutate(
date_time_of_admission = paste0(
date_hospitalisation, time_admission_clean, sep = " "
) %>% ymd_hm()
)lubridate can also be used for a variety of other functions, such as extracting aspects of a date/datetime, performing date arithmetic, or calculating date intervals
# extract the month from this date
example_date <- ymd("2020-03-01")
# extract the month and year from this date
month(example_date)## [1] 3
year(example_date)## [1] 2020
# get the epiweek of this date (this will be expanded later)
epiweek(example_date)## [1] 10
# get the day of the week for this date (this will be expanded later)
wday(example_date)## [1] 1
# add 3 days to this date
example_date + days(3)## [1] "2020-03-04"
# add 7 weeks and subtract two days from this date
example_date + weeks(7) - days(2)## [1] "2020-04-17"
# find the interval between this date and Feb 20 2020
example_date - ymd("2020-02-20")## Time difference of 10 days
This can all be brought together to work with data - for example:
library(lubridate)
linelist_cleaned <- linelist_cleaned %>%
# convert date of onset from character to date objects by specifying dmy format
mutate(date_of_onset = dmy(date_of_onset),
date_of_hospitalisation = dmy(date_of_hospitalisation)) %>%
# filter out all cases without onset in march
filter(month(date_of_onset) == 3) %>%
# find the difference in days between onset and hospitalisation
mutate(onset_to_hosp_days = date_of_hospitalisation - date_of_onset)guess_dates()The function guess_dates() attempts to read a “messy” date variable containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(), which is in the linelist package.
For example:
guess_dateswould see the following dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them in the class Date to: 2018-01-03, 1982-03-07, and 1985-08-20.
linelist::guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # guess_dates() not yet available on CRAN for R 4.0.2
# try install via devtools::install_github("reconhub/linelist")Some optional arguments for guess_dates() that you might include are:
error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)last_date - the last valid date (defaults to current date)first_date - the first valid date. Defaults to fifty years before the last_date.# An example using guess_dates on the variable dtdeath
data_cleaned <- data %>%
mutate(
dtdeath = linelist::guess_dates(
dtdeath, error_tolerance = 0.1, first_date = "2016-01-01"
)Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() or as_date() function to convert, but instead of supplying a format as above, supply an origin date. This will not work if the excel date is read as a character type, so be sure to ensure the date is a numeric class (or convert it to one)!
NOTE: You should provide the origin date in R’s default date format ("YYYY-MM-DD").
library(lubridate)
library(dplyr)
# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>%
mutate(date_of_onset = as_date(as.double(date_of_onset), origin = "1899-12-30"))Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the function format(), which works in a similar way as as.Date(). Read more in this online tutorial. Remember that the output from format() is a character type, so is generally used for display purposes only!
%d = Day # (of the month e.g. 16, 17, 18…)
%a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds
%z = offset from GMT
%Z = Time zone (character)
An example of formatting today’s date:
# today's date, with formatting
format(Sys.Date(), format = "%d %B %Y")## [1] "21 February 2021"
# easy way to get full date and time (no formatting)
date()## [1] "Sun Feb 21 15:39:56 2021"
# formatted date, time, and time zone (using paste0() function)
paste0(
format(Sys.Date(), format = "%A, %b %d '%y, %z %Z, "),
format(Sys.time(), format = "%H:%M:%S")
)## [1] "Sunday, Feb 21 '21, +0000 UTC, 15:39:57"
The difference between dates can be calculated by:
# define variables as date classes
date_of_onset <- ymd("2020-03-16")
date_lab_confirmation <- ymd("2020-03-20")
# find the delay between onset and lab confirmation
days_to_lab_conf <- as.double(date_lab_confirmation - date_of_onset)
days_to_lab_conf## [1] 4
In a dataframe format (i.e. when working with a linelist), if either of the above dates is missing, the operation will fail for that row. This will result in an NA instead of a numeric value. When using this column for calculations, be sure to set the na.rm option to TRUE. For example:
# add a new column
# calculating the number of days between symptom onset and patient outcome
linelist_delay <- linelist_cleaned %>%
mutate(
days_onset_to_outcome = as.double(date_of_outcome - date_of_onset)
)
# calculate the median number of days to outcome for all cases where data are available
med_days_outcome <- median(linelist_delay$dats_onset_to_outcome, na.rm = T)
# often this operation might be done only on a subset of data cases, e.g. those who died
# this is easy to look at and will be explained later in the handbookWhen data is present in different time time zones, it can often be important to standardise this data in a unified time zone. This can present a further challenge, as the time zone component of data must be coded manually in most cases.
In R, each datetime object has a timezone component. By default, all datetime objects will carry the local time zone for the computer being used - this is generally specific to a location rather than a named timezone, as time zones will often change in locations due to daylight savings time. It is not possible to accurately compensate for time zones without a time component of a date, as the event a date variable represents cannot be attributed to a specific time, and therefore time shifts measured in hours cannot be reasonably accounted for.
To deal with time zones, there are a number of helper functions in lubridate that can be used to change the time zone of a datetime object from the local time zone to a different time zone. Time zones are set by attributing a valid tz database time zone to the datetime object. A list of these can be found here - if the location you are using data from is not on this list, nearby large cities in the time zone are available and serve the same purpose.
https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
# assign the current time to a variable
time_now <- Sys.time()
time_now## [1] "2021-02-21 15:39:57 EST"
# use with_tz() to assign a new timezone to the variable, while CHANGING the clock time
time_london_real <- with_tz(time_now, "Europe/London")
# use force_tz() to assign a new timezone to the variable, while KEEPING the clock time
time_london_local <- force_tz(time_now, "Europe/London")
# note that as long as the computer that was used to run this code is NOT set to London time, there will be a difference in the times (the number of hours difference from the computers time zone to london)
time_london_real - time_london_local## Time difference of 5 hours
This may seem largely abstract, and is often not needed if the user isn’t working across time zones. One simple example of its implementation is:
# TODO add when time variable is here
# set the time variable to time zone for ebola outbreak
# "Africa/Lubumbashi" is the time zone for eastern DRC/Kivu NordUse the floor_date() function from lubridate, with unit = "week". See example below for specifying the week start day. The returned output is the start date of the week, in Date class.
For example, to create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.
To aggregate into weeks and show ALL weeks (even ones with no cases), do this:
mutate(), using floor_date() from the lubridate package:
unit = to set the desired time unit, e.g. "week`week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)complete() to ensure that all weeks appear - even those with no cases.For example:
# Make dataset of weekly case counts
weekly_counts <- linelist %>%
mutate(
week = lubridate::floor_date(date_onset,
unit = "week")) %>% # new column of week of onset
count(week) %>% # group data by week and count rows per group
filter(!is.na(week)) %>% # remove entries for cases missing date_onset
complete(week = seq.Date(from = min(week), # fill-in all weeks with no cases reported
to = max(week),
by="week"))Here are the first 20 rows of the resulting dataframe:
You can also use the package aweek to set epidemiological weeks. You can read more about it on the RECON website
See the section on epicurves.
lead() and lag() are functions from the dplyr package which help find previous (lagged) or subsequent (leading) values in a vector - typically a numeric or date vector. This is useful when doing calculations of change/difference between time units.
Let’s say you want to calculate the difference in cases between a current week and the previous one. The data are initially provided in weekly counts as shown below. To learn how to aggregate counts from daily to weekly see the page on aggregating (LINK).
When using lag() or lead() the order of rows in the dataframe is very important! - pay attention to whether your dates/numbers are ascending or descending
First, create a new column containing the value of the previous (lagged) week.
n = (must be a non-negative integer)default = to define the value placed in non-existing rows (e.g. the first row for which there is no lagged value). By default this is NA.order_by = TRUE if your reference column is not orderedcounts <- counts %>%
mutate(cases_prev_wk = lag(cases_wk, n = 1))Next, create a new column which is the difference between the two cases columns:
counts <- counts %>%
mutate(cases_prev_wk = lag(cases_wk, n = 1),
case_diff = cases_wk - cases_prev_wk)You can read more about lead() and lag() in the documentation here or by entering ?lag in your console.
Sys.Date( ) returns the current date of your computerSys.Time() returns the current time of your computerdate() returns the current date and time.This page will cover:
NA is displayed in plotsLoad packages
pacman::p_load(
tidyverse,
rio
)Load data
linelist <- rio::import("linelist_cleaned.rds")The following are useful functions when assessing or handling missing values:
is.na() and !is.na()
To identify missing values use is.na() or its opposite (with ! in front). Both are from base R.
These return a logical vector (TRUE or FALSE). Remember that you can sum() the resulting vector to count the number TRUE, e.g. sum(is.na(linelist$date_outcome)).
my_vector <- c(1, 4, 56, NA, 5, NA, 22)
is.na(my_vector)## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
!is.na(my_vector)## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
na.omit()
This function, if applied to a dataframe, will remove rows with any missing values. It is also from base R.
If applied to a vector, it will remove NA values from the vector it is applied to. For example:
sum(na.omit(my_vector))## [1] 88
na.rm = TRUE
Often a mathematical function will by default include NA in calculations, which results in the function returning NA (this is designed intentionally, to make you aware that you have missing data).
You can usually avoid this by removing missing values from the calculation, by including the argument na.rm = TRUE (na.rm stands for “remove NA”).
mean(my_vector)## [1] NA
mean(my_vector, na.rm = TRUE)## [1] 17.6
You can use the package naniar to assess and visualize missingness.
pacman::p_load(naniar)Some basic missingness functions from naniar include:
The percent of all values that are missing
pct_miss(linelist) # also see n_miss() for counts## [1] 6.616847826086956
These two functions return the percent of rows with any missing values, or that are entirely complete, respectively. Note that "" or " " will register as non-missing.
pct_miss_case(linelist) # also see n_complete() for counts## [1] 68.63111413043478
pct_complete_case(linelist) # see n_complete## [1] 31.36888586956522
The gg_miss_var() function will tell you the number missing in each column. You can add a bare column name to the argument facet = if desired to see the plot by groups. By default, counts are shown instead of percents (show_pct = FALSE). You can also add labs as a normal ggplot with +labs().
gg_miss_var(linelist, show_pct = TRUE)You can use vis_miss() to visualize the dataframe as a heatmap, showing whether each value is missing or not:
vis_miss(linelist)How do you visualize something that is not there??? By default, ggplot removes points with missing values from plots.
naniar offers a solution via geom_miss_point(). When creating a scatterplot of two columns, records with one of the values missing and the other present are shown by setting the missing values to 10% lower than the lowest value in the column, and coloring them distinctly.
In the scatterplot below, the red dots are records where the value for one column is present but the value for the other column is missing.
ggplot(
linelist,
aes(x = age_years,
y = temp)) + # column to show missingness
geom_miss_point()To assess missingness in the dataframe by another column, consider gg_miss_fct(), which returns a heatmap of percent missingness in the dataframe by a factor/categorical (or date) column:
gg_miss_fct(linelist, age_cat5)This function can also be used on date column to neat effect:
gg_miss_fct(linelist, date_onset)## Warning: Removed 29 rows containing missing values (geom_tile).
“Shadow” columns
Another way to visualize missingness in one column by values in a second column is using the “shadow” that naniar can create. Essentially, bind_shadow() creates a binary NA/not NA column for every column, and adds all these columns to the dataset (doubling the number of columns). See below:
shadowed_linelist <- linelist %>%
bind_shadow()
names(shadowed_linelist)## [1] "case_id" "generation" "date_infection" "date_onset" "date_hospitalisation"
## [6] "date_outcome" "outcome" "gender" "age" "age_unit"
## [11] "age_years" "age_cat" "age_cat5" "hospital" "lon"
## [16] "lat" "infector" "source" "wt_kg" "ht_cm"
## [21] "ct_blood" "fever" "chills" "cough" "aches"
## [26] "vomit" "temp" "time_admission" "bmi" "days_onset_hosp"
## [31] "case_id_NA" "generation_NA" "date_infection_NA" "date_onset_NA" "date_hospitalisation_NA"
## [36] "date_outcome_NA" "outcome_NA" "gender_NA" "age_NA" "age_unit_NA"
## [41] "age_years_NA" "age_cat_NA" "age_cat5_NA" "hospital_NA" "lon_NA"
## [46] "lat_NA" "infector_NA" "source_NA" "wt_kg_NA" "ht_cm_NA"
## [51] "ct_blood_NA" "fever_NA" "chills_NA" "cough_NA" "aches_NA"
## [56] "vomit_NA" "temp_NA" "time_admission_NA" "bmi_NA" "days_onset_hosp_NA"
These “shadow” columns can be used to plot the density of proportion of values that are missing by another column X. For example, the plot below shows the proportion of records missing days_onset_hosp (number of days from symptom onset to hospitalisation), by that record’s value in date_hospitalisation. Essentially, you are plot the density of the x-axis column, but stratify the results (color =) by a shadow column of interest. This analysis works best if the x-axis is numeric or date column.
ggplot(
shadowed_linelist, # dataframe with shadow columns
aes(x = date_hospitalisation, # numeric or date column
colour = age_years_NA)) + # shadow column of interest
geom_density() # plots the density curvesYou can also use these “shadow” columns to stratify a statistical summary, as shown below:
linelist %>%
bind_shadow() %>% # create the shows cols
group_by(date_outcome_NA) %>% # shadow col for stratifying
summarise_at(.vars = c("age_years"), # variable of interest for calculations
.funs = c("mean", "sd", "var", "min", "max"), # stats to calculate
na.rm = TRUE) # other arguments for the stat calculations## # A tibble: 2 x 6
## date_outcome_NA mean sd var min max
## * <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 !NA 16.0 12.6 158. 0 77
## 2 NA 16.3 12.6 158. 0 75
An alternative way to plot the proportion of values in one column, including missingness, is given below. It does not involve naniar. This example shows percent of weekly observations that are missing in a column):
NA (and any other values of interest)ggplot()Below, we take the linelist, add a new column for week, group the data by week, and then calculate the percent of that week’s records where the value is missing. (note: if you want % of 7 days the calculation would be slightly different).
outcome_missing <- linelist %>%
mutate(week = lubridate::floor_date(date_onset, "week")) %>% # create new week column
group_by(week) %>% # group the rows by week
summarize( # summarize each week
n_obs = n(), # number of records
outcome_missing = sum(is.na(outcome) | outcome == ""), # number of records missing the value
outcome_p_miss = outcome_missing / n_obs, # proportion of records missing the value
outcome_dead = sum(outcome == "Death", na.rm=T), # number of records as dead
outcome_p_dead = outcome_dead / n_obs) %>% # proportion of records as dead
tidyr::pivot_longer(-week, names_to = "statistic") %>% # pivot all columns except week, to long format for ggplot
filter(stringr::str_detect(statistic, "_p_")) # keep only the proportion valuesThen we plot the proportion missing as a line, by week
ggplot(data = outcome_missing)+
geom_line(
aes(x = week, y = value, group = statistic, color = statistic),
size = 2,
stat = "identity")+
labs(title = "Weekly outcomes",
x = "Week",
y = "Proportion of weekly records") +
scale_color_discrete(
name = "",
labels = c("Died", "Missing outcome"))+
scale_y_continuous(breaks = c(seq(0,1,0.1)))+
theme_minimal()+
theme(
legend.position = "bottom"
)## Warning: Removed 2 row(s) containing missing values (geom_path).
To quickly remove rows with missing values, use the dplyr function drop_na().
The original linelist has nrow(linelist) rows. The adjusted number of rows is shown below:
linelist %>%
drop_na() %>% # remove rows with ANY missing values
nrow()Additionally you can specify columns to evaluate for missingness:
linelist %>%
drop_na(date_onset) %>% # remove rows missing date_onset
nrow()## [1] 5647
Multiple columns can be specified one after the other, or using this standard syntax:
linelist %>%
drop_na(contains("date")) %>% # remove rows missing values in any "date" column
nrow()## [1] 3053
NA in ggplot()It is often wise to report the number of values excluded from a plot in a caption. Below is an example:
In ggplot(), you can add labs() and within it a caption =. In the caption, you can use str_glue() from stringr package to paste values together into a sentence dynamically so they will adjust to the data. An example is below:
\n for a new line.labs(
title = "Weekly case incidence, by gender",
y = "Weekly case incidence",
x = "Week of symptom onset",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; {nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown.")) Sometimes, it can be easier to save the value as an object in commands before the ggplot() command, and simply reference the named value within the str_glue().
Sometimes, when analyzing your data, it will be important to “fill in the gaps” and impute missing data While you can always simply analyze a dataset after removing all missing values, this can cause problems in many ways. Here are two examples:
By removing all observations with missing values or variables with a large amount of missing data, you might reduce your power or ability to do some types of analysis. For example, as we discovered earlier, only 31.7% of the observations in our linelist dataset have no missing data across all of our variables. If we removed the majority of our dataset we’d be losing a lot of information! And, most of our variables have some amount of missing data–for most analysis it’s probably not reasonable to drop every variable that has a lot of missing data either.
Depending on why your data is missing, analysis of only non-missing data might lead to biased or misleading results. For example, as we learned earlier we are missing data for some patients about whether they’ve had some important symptoms like fever or cough. But, as one possibility, maybe that information wasn’t recorded for people that just obviously weren’t very sick. In that case, if we just removed these observations we’d be excluding some of the healthiest people in our dataset and that might really bias any results.
It’s important to think about why your data might be missing in addition to seeing how much is missing. Doing this can help you decide how important it might be to impute missing data, and also which method of imputing missing data might be best in your situation.
Here are three general types of missing data:
Missing Completely at Random (MCAR). This means that there is no relationship between the probability of data being missing and any of the other variables in your data. The probability of being missing is the same for all cases This is a rare situation. But, if you have strong reason to believe your data is MCAR analyzing only non-missing data without imputing won’t bias your results (although you may lose some power). [TODO: consider discussing statistical tests for MCAR]
Missing at Random (MAR). This name is actually a bit misleading as MAR means that your data is missing in a systematic, predictable way based on the other information you have. For example, maybe every observation in our dataset with a missing value for fever was actually not recorded because every patient with chills and and aches was just assumed to have a fever so their temperature was never taken. If true, we could easily predict that every missing observation with chills and aches has a fever as well and use this information to impute our missing data. In practice, this is more of a spectrum. Maybe if a patient had both chills and aches they were more likely to have a fever as well if they didn’t have their temperature taken, but not always. This is still predictable even if it isn’t perfectly predictable. This is a common type of missing data
Missing not at Random (MNAR). Sometimes, this is also called Not Missing at Random (NMAR). This assumes that the probability of a value being missing is NOT systematic or predictable using the other information we have but also isn’t missing randomly. In this situation data is missing for unknown reasons or for reasons you don’t have any information about. For example, in our dataset maybe information on age is missing because some very elderly patients either don’t know or refuse to say how old they are. In this situation, missing data on age is related to the value itself (and thus isn’t random) and isn’t predictable based on the other information we have. MNAR is complex and often the best way of dealing with this is to try to collect more data or information about why the data is missing rather than attempt to impute it.
In general, imputing MCAR data is often fairly simple, while MNAR is very challenging if not impossible. Many of the common data imputation methods assume MAR.
Some useful packages for imputing missing data are Mmisc, missForest (which uses random forests to impute missing data), and mice (Multivariate Imputation by Chained Equations). For this section we’ll just use the mice package, which implements a variety of techniques. The maintainer of the mice package has published an online book about imputing missing data that goes into more detail here (https://stefvanbuuren.name/fimd/).
Here is the code to load the mice package:
pacman::p_load(mice)Sometimes if you are doing a simple analysis or you have strong reason to think you can assume MCAR, you can simply set missing numerical values to the mean of that variable. Perhaps we can assume that missing temperature measurements in our dataset were either MCAR or were just normal values. Here is the code to create a new variable that replaces missing temperature values with the mean temperature value in our dataset. However, in many situations replacing data with the mean can lead to bias, so be careful.
linelist = linelist %>% mutate(temp_replace_na_with_mean = replace_na(temp, mean(temp, na.rm = T)))You could also do a similar process for replacing categorical data with a specific value. For our dataset, imagine you knew that all observations with a missing value for their outcome (which can be “Death” or “Recover”) were actually people that died (note: this is not actually true for this dataset):
linelist = linelist %>% mutate(outcome_replace_na_with_death =
replace_na(outcome, "Death"))A somewhat more advanced method is to use some sort of statistical model to predict what a missing value is likely to be and replace it with the predicted value. Here is an example of creating predicted values for all the observations where temperature is missing, but age and fever are not using simple linear regression using fever status, and age in years as predictors. In practice you’d want to use a better model than this sort of simple approach.
simple_temperature_model_fit = lm(temp ~ fever + age_years, data = linelist)
predictions_for_missing_temps = predict(simple_temperature_model_fit,
newdata = linelist %>% filter(is.na(temp))) #using our simple temperature model to predict values just for the observations where temp is missingOr, using the same modeling approach through the mice package to create imputed values for the missing temperature observations:
model_dataset = linelist %>%
select(temp, fever, age_years)
temp_imputed_values = mice(model_dataset, method = "norm.predict", seed = 1, m = 1, print = F)$imp$temp## Warning: Number of logged events: 1
This is the same type of approach by some more advanced methods like using the missForest package to replace missing data with predicted values. In that case, the prediction model is a random forest instead of a linear regression. You can use other types of models to do this as well. However, while this approach works well under MCAR you should be a bit careful if you believe MAR or MNAR more accurately describes your situation. The quality of your imputation will depend on how good your prediction model is and even with a very good model the variability of your imputed data may be underestimated.
Last observation carried forward (LOCF) and baseline observation carried forward (BOCF) are imputation methods for time series/longitudinal data. The idea is to take the previous observed value as a replacement for the missing data. When multiple values are missing in succession, the method searches for the last observed value.
[TO BE COMPLETED]
The online book we mentioned earlier by the author of the mice package (https://stefvanbuuren.name/fimd/) contains a detailed explanation of multiple imputation and why you’d want to use it. But, here is a basic explanation of the method:
When you do multiple imputation, you create multiple datasets with the missing values imputed to plausible data values (depending on your research data you might want to create more or less of these imputed datasets, but the mice package sets the default number to 5). The difference is that rather than a single, specific value each imputed value is drawn from an estimated distribution (so it includes some randomness). As a result, each of these datasets will have slightly different different imputed values (however, the non-missing data will be the same in each of these imputed datasets). You still using some sort of predictive model to do the imputation in each of these new datasets (mice has many options for prediction methods including Predictive Mean Matching, logistic regression, and random forest) but the mice package can take care of many of the modeling details.
Then, once you have created these new imputed datasets, you can apply then apply whatever statistical model or analysis you were planning to do for each of these new imputed datasets and pool the results of these models together. This works very well to reduce bias in both MCAR and many MAR settings and often results in more accurate standard error estimates.
Here is an example of applying the Multiple Imputation process to predict temperature in our linelist dataset using a age and fever status (our simplified model_dataset from above): [Note from Daniel: this is not a very good model example and I’ll change it later]
multiple_imputation = mice(model_dataset, seed = 1, m = 10, print = FALSE) #imputing missing values for all variables in our model_dataset, and creating 10 new imputed datasets## Warning: Number of logged events: 1
model_fit <- with(multiple_imputation, lm(temp ~ age_years + fever))
base::summary(mice::pool(model_fit))## term estimate std.error statistic df p.value
## 1 (Intercept) 3.699593961153406e+01 0.0205454579070129262 1800.687031604487629 135.5811552330955 0.0000000000000000
## 2 age_years 8.060782457373544e-04 0.0006190182408225061 1.302188195078544 142.8000092289232 0.1949484930950736
## 3 feveryes 2.010504661283107e+00 0.0181511353498930092 110.764677940377865 612.2380050678109 0.0000000000000000
Here we used the mice default method of imputation, which is Predictive Mean Matching. We then used these imputed datasets to separately estimate and then pool results from simple linear regressions on each of these datasets. There are many details we’ve glossed over and many settings you can adjust during the Multiple Imputation process while using the mice package. For example, you won’t always have numerical data and might need to use other imputation methods (you can still use the mice package for many other types of data and methods). But, for a more robust analysis when missing data is a significant concern, Multiple Imputation is good solution that isn’t always much more work than doing a complete case analysis.
This page reviews how to group and aggregate data for descriptive analysis. It makes use of tidyverse packages for common and easy-to-use functions.
Grouping data is a core component of data management and analysis. Grouped data can be plotted, or summarised by group (whether by time period, place, or a relevant categorical variable). Functions from the dplyr package (part of the tidyverse) make grouping and subsequent operations quite easy.
This page will how to perform these grouping operations using
group_by() command in dplyr
aggregate() command as an alternativeLoad packages
Ensure tidyverse package is installed and loaded (includes dplyr).
pacman::p_load(rio, # to import data
here, # to locate files
tidyverse, # to clean, handle, and plot the data (includes dplyr)
janitor # adding total rows and columns
)Load data
For this page we use the cleaned linelist dataset
linelist <- rio::import(here("data", "linelist_cleaned.xlsx"))The first 50 rows of linelist:
The function group_by() from dplyr groups the rows by the unique values in the specified columns. Each unique value contitutes a group (or unique combination of values, if multiple grouping columns are specified). Subsequent changes to the dataset or calculations can then be performed within the context of each unique group.
For example, the command below takes the linelist and groups the rows by unique values in column outcome, saving the output as a new dataframe ll_by_outcome. The column name is placed inside the parentheses of the function group_by().
ll_by_outcome <- linelist %>%
group_by(outcome)Note that there is no perceptible change to the dataset after group_by(), until another dplyr verb such as mutate() or summarise() is applied on the “grouped” dataframe.
You can however “see” the groupings by printing the dataframe. When you print a grouped dataframe, you will see it has been transformed into a tibble class object (LINK) which, when printed, displays which grouping columns have been applied and how many groups there are - written just above the header row.
# print to see which groups are active
ll_by_outcome## # A tibble: 5,888 x 30
## # Groups: outcome [3]
## case_id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat
## <chr> <dbl> <date> <date> <date> <date> <chr> <chr> <dbl> <chr> <dbl> <fct> <fct> <chr> <dbl> <dbl>
## 1 d8a13d 4 2014-05-06 2014-05-08 2014-05-10 NA <NA> f 3 years 3 0-4 0-4 St. Mar~ -13.2 8.46
## 2 8689b7 4 NA 2014-05-13 2014-05-14 2014-05-18 Recover f 7 years 7 5-9 5-9 Missing -13.2 8.45
## 3 11f8ea 2 NA 2014-05-16 2014-05-18 2014-05-30 Recover m 21 years 21 20-29 20-24 St. Mar~ -13.2 8.46
## 4 dae8c7 3 2014-05-23 NA 2014-05-27 2014-05-30 Death f 4 years 4 0-4 0-4 Port Ho~ -13.2 8.45
## 5 acf422 6 2014-05-25 2014-05-27 2014-05-28 2014-06-27 Recover m 4 years 4 0-4 0-4 Central~ -13.3 8.48
## 6 1a4ac9 6 NA 2014-05-27 2014-05-29 2014-06-07 Death m 30 years 30 30-49 30-34 Port Ho~ -13.3 8.45
## 7 275cc7 5 2014-05-24 2014-05-27 2014-05-28 2014-06-07 Death f 13 years 13 10-14 10-14 Central~ -13.2 8.47
## 8 1389ca 4 NA 2014-06-05 2014-06-07 2014-06-09 Death f 2 years 2 0-4 0-4 Missing -13.3 8.47
## 9 057e7a 7 2014-06-04 2014-06-14 2014-06-15 NA Recover f 4 years 4 0-4 0-4 Missing -13.2 8.47
## 10 c97dd9 9 NA NA 2014-06-19 2014-07-11 Recover m 22 years 22 20-29 20-24 Port Ho~ -13.2 8.47
## # ... with 5,878 more rows, and 14 more variables: infector <chr>, source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>,
## # cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>
The groups created reflect each unique combination of values in the grouping columns. To see the groups and the number of rows in each group, pass the grouped data to tally().
See below that there are three unique values in the grouping column outcome: “Death”, “Recover”, and NA. See that there were 2582 deaths, 1983 recoveries, and 1323 with no outcome recorded.
linelist %>%
group_by(outcome) %>%
tally()## # A tibble: 3 x 2
## outcome n
## * <chr> <int>
## 1 Death 2582
## 2 Recover 1983
## 3 <NA> 1323
You can group by more than one column. Below, the dataframe is grouped by outcome and gender, and then tallied. Note how each unique combination of outcome and gender is registered as its own group - including missing values for either column.
linelist %>%
group_by(outcome, gender) %>%
tally()## # A tibble: 9 x 3
## # Groups: outcome [3]
## outcome gender n
## <chr> <chr> <int>
## 1 Death f 1236
## 2 Death m 1211
## 3 Death <NA> 135
## 4 Recover f 944
## 5 Recover m 947
## 6 Recover <NA> 92
## 7 <NA> f 631
## 8 <NA> m 633
## 9 <NA> <NA> 59
You can also create a new grouping column within the group_by() statement. This is equivalent to calling mutate() before the group_by(). For a quick tabulation this style can be handy, but for more clarity in your code consider creating this column in it’s own mutate() step and then piping to group_by().
# group dat based on a binary column created *within* the group_by() command
linelist %>%
group_by(
age_class = ifelse(age >= 18, "adult", "child")) %>%
tally(sort = T)## # A tibble: 3 x 2
## age_class n
## <chr> <int>
## 1 child 3598
## 2 adult 2202
## 3 <NA> 88
By default if you run group_by() on data that are already grouped, the old groups will be removed and the new one(s) will apply. If you want to add new groups to the existing ones, add the argument .add=TRUE.
# Grouped by outcome
by_outcome <- linelist %>%
group_by(outcome)
# Add grouping by gender in addition
by_outcome_gender <- by_outcome %>%
group_by(gender, .add = TRUE)Data that have been grouped will remain grouped until specifically ungrouped via ungroup(). If you forget to ungroup, it can lead to incorrect calculations! Below is an example of removing all grouping columns:
linelist %>%
group_by(outcome, gender) %>%
tally() %>%
ungroup()You can also remove grouping by only specific columns, by placing the column name inside.
linelist %>%
group_by(outcome, gender) %>%
tally() %>%
ungroup(gender)NOTE: The verb count() automatically ungroups the data after counting.
By applying the dplyr verb summarise() to grouped data, you can produce summary tables containing descriptive statistics for each group.
Within the summarise statement, provide the name(s) of the new summary column(s), an equals sign, and then a statistical function to apply to the data, as shown below. Within a statistical function, list the column to be operated on and any relevant arguements. For example, do not forget na.rm=TRUE to remove missing values from calculations!
Below is an example of summarise() applied without grouped data. The statistics returned are produced from the entire dataset.
linelist %>%
summarise(
mean_age = mean(age_years, na.rm=T),
max_age = max(age_years, na.rm=T),
min_age = min(age_years, na.rm=T))## mean_age max_age min_age
## 1 16.06242816091954 77 0
In contrast, below is the same summarise() statement applied to grouped data. The statistics are calculated for each outcome group.
linelist %>%
group_by(outcome) %>%
summarise(
mean_age = mean(age_years, na.rm=T),
max_age = max(age_years, na.rm=T),
min_age = min(age_years, na.rm=T))## # A tibble: 3 x 4
## outcome mean_age max_age min_age
## * <chr> <dbl> <dbl> <dbl>
## 1 Death 16.5 74 0
## 2 Recover 15.7 77 0
## 3 <NA> 15.7 75 0
TIP: Summarise works with both UK and US spelling - summarise() and summarize() call the same function.
across() multiple columnsYou can use summarise across multiple columns using across(). Provide a vector of column names, or use the same semantic helper functions used in select() (look below) to specify columns by or by criteria.
Below, mean() is applied to ungrouped data (global calculation). The columns are specified, a function is specified (no parentheses), and finally, any additional arguments for the function (e.g. na.rm=TRUE).
linelist %>%
summarise(across(.cols = c(age_years, temp),
.fns = mean,
na.rm=T))## age_years temp
## 1 16.06242816091954 38.56498263888889
Below, the same summarise across call is applied on grouped data:
linelist %>%
group_by(outcome) %>%
summarise(across(.cols = c(age_years, temp), .fns = mean, na.rm=T))## # A tibble: 3 x 3
## outcome age_years temp
## * <chr> <dbl> <dbl>
## 1 Death 16.5 38.5
## 2 Recover 15.7 38.6
## 3 <NA> 15.7 38.6
Here are those select() helper functions that you can place within across():
There are helpers available to assist you in specifying columns:
everything() - all other columns not mentionedlast_col() - the last columnwhere() - applies a function to all columns and selects those which are TRUEstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - columns containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() -any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))For example, to return the mean of every numeric column:
linelist %>%
group_by(outcome) %>%
summarise(across(where(is.numeric), .fns = mean, na.rm=T))## # A tibble: 3 x 12
## outcome generation age age_years lon lat wt_kg ht_cm ct_blood temp bmi days_onset_hosp
## * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Death 16.7 16.6 16.5 -13.2 8.47 54.0 127. 21.3 38.5 46.8 1.83
## 2 Recover 16.4 15.8 15.7 -13.2 8.47 52.4 124. 21.1 38.6 49.0 2.33
## 3 <NA> 16.5 15.7 15.7 -13.2 8.47 52.3 125. 21.2 38.6 47.6 2.08
If you want summary multiple statistics of multiple columns, in an easy-to-read format, consider a two-way table with the gtsummary package This package is demonstrated more extensively in the the Statistics page (LINK).
library(gtsummary)
linelist %>%
select(outcome, age_years, temp, ht_cm) %>% # select columns (optional)
gtsummary::tbl_summary(
by = outcome, # indicate grouping column (optional)
statistic = all_continuous() ~ "{mean} ({sd})") # return mean and std deviation for each group| Characteristic | Death, N = 2,5821 | Recover, N = 1,9831 |
|---|---|---|
| age_years | 17 (13) | 16 (13) |
| Unknown | 48 | 23 |
| temp | 38.55 (1.00) | 38.59 (0.96) |
| Unknown | 51 | 53 |
| ht_cm | 127 (49) | 124 (50) |
|
1
Mean (SD)
|
||
count() and tally() provide similar functionality but are different.
tally() is shorthand for summarise(), and does not automatically group data. Thus, to achieve grouped tallys it must follow a group_by() command. You can add sort = TRUE to see the largest groups first.
linelist %>%
tally## n
## 1 5888
linelist %>%
group_by(outcome) %>%
tally(sort = TRUE)## # A tibble: 3 x 2
## outcome n
## <chr> <int>
## 1 Death 2582
## 2 Recover 1983
## 3 <NA> 1323
In contrast, count() does the following:
group_by() on the specified column(s)summarise() and returned column n with the number of observations per groupungroup()linelist %>%
count(outcome)## outcome n
## 1 Death 2582
## 2 Recover 1983
## 3 <NA> 1323
Just like with group_by() you can create a new column within the count() command:
linelist %>%
count(age_class = ifelse(age >= 18, "adult", "child"), sort = T)## age_class n
## 1 child 3598
## 2 adult 2202
## 3 <NA> 88
Read more about the distinction between tally() and count() here
Both of these verbs can be called multiple times, with the functionality “rolling up”. For example, to summarise the number of genders present for each outcome, run the following. Note, the name of the final column is changed from default “n” for clarity.
linelist %>%
# produce counts by outcome-gender groups
count(outcome, gender) %>%
# produce counts of gender within each outcome group
count(outcome, name = "number of genders per outcome" ) ## outcome number of genders per outcome
## 1 Death 3
## 2 Recover 3
## 3 <NA> 3
If you want to add total rows or column after using tally() or count(), consider using the janitor package, which offers functions like adorn_totals() and adorn_percentages(). There are many useful functions (search Help for details), here are a few of them:
adorn_totals() to get totals - specify the argument where = either “row” or “col” or c("row", "col").adorn_percentages() to convert counts to proportions - specify the argument denominator = either “row”, “col”, or “all”.adorn_pct_formatting() to convert proportions to percentages (can specify number of digits =, whether to add “%” with affix_sign =, and specify specific column names to operate on)adorn_ns() to add back the underlying counts (“N”s) to a table whose proportions were calculated by adorn_percentages() - to display them together. Indicate position = of the Ns as either “rear” or “front” of the proportions.To add totals:
linelist %>%
count(outcome) %>%
adorn_totals(where = "col")## outcome n Total
## Death 2582 2582
## Recover 1983 1983
## <NA> 1323 1323
To convert the numbers to proportions:
linelist %>%
count(outcome) %>%
adorn_totals(where = "row") %>% # add total row
adorn_percentages(denominator = "col") %>% # convert to proportions
adorn_rounding(digits = 2) # round the proportions## outcome n
## Death 0.44
## Recover 0.34
## <NA> 0.22
## Total 1.00
janitor functions can be use together, as below:
linelist %>%
count(outcome) %>% # produce the counts by unique outcome
adorn_totals(where = "row") %>% # add total row
adorn_percentages("col") %>% # add proportion by column
adorn_pct_formatting() %>% # proportion converted to percent
adorn_ns(position = "front") # Add the underlying N, in front of the percentage## outcome n
## Death 2582 (43.9%)
## Recover 1983 (33.7%)
## <NA> 1323 (22.5%)
## Total 5888 (100.0%)
TO DO
Using the dplyr verb arrange() to order the rows in a dataframe behaves the same when the data are grouped, *unless you set the argument .by_group =TRUE. In this case the rows are ordered first by the grouping columns and then by any other columns you specify.
filter()When applied in conjunction with functions that evaluate the dataframe (like max(), min(), mean()), these functions will now be applied to the groups. For example, if you want to filter and keep rows where patients are above the median age, this will now apply per group.
TO DO MORE
The dplyr function slice(), which subsets rows based on their position in the data, can also be applied per group. Remember to account for sorting the data within each group to get the desired “slice”.
For example, to retrieve only the latest 5 admissions from each hospital:
hospitaldate_hospitalisation within each hospital grouplinelist %>%
group_by(hospital) %>%
arrange(hospital, date_hospitalisation) %>%
slice_head(n = 5) %>%
arrange(hospital) %>%
select(case_id, hospital, date_hospitalisation)## # A tibble: 30 x 3
## # Groups: hospital [6]
## case_id hospital date_hospitalisation
## <chr> <chr> <date>
## 1 20b688 Central Hospital 2014-05-06
## 2 d58402 Central Hospital 2014-05-10
## 3 b8f2fd Central Hospital 2014-05-13
## 4 acf422 Central Hospital 2014-05-28
## 5 275cc7 Central Hospital 2014-05-28
## 6 d1fafd Military Hospital 2014-04-17
## 7 974bc1 Military Hospital 2014-05-13
## 8 6a9004 Military Hospital 2014-05-13
## 9 09e386 Military Hospital 2014-05-14
## 10 865581 Military Hospital 2014-05-15
## # ... with 20 more rows
slice_head() - selects n rows from the top
slice_tail() - selects n rows from the end
slice_sample() - randomly selects n rows
slice_min() - selects n rows with highest values in order_by = column, use with_ties = TRUE to keep ties
slice_max() - selects n rows with lowest values in order_by = column, use with_ties = TRUE to keep ties
The function add_count() adds a column n to the original data giving the number of rows in that row’s group.
Shown below for simplicity is a selection of the linelist data - add_count() is applied to hospital, so the values in column n reflect the number of rows in that row’s hospital group. Note how values are repeated. In the example below, the column name n could be changed to name =
linelist %>%
select(case_id, hospital) %>%
add_count(hospital) %>% # add "number of rows admitted to same hospital as this row"
head(10) # show just the first 10 rows, for demo purposes## case_id hospital n
## 1 d8a13d St. Mark's Maternity Hospital (SMMH) 422
## 2 8689b7 Missing 1469
## 3 11f8ea St. Mark's Maternity Hospital (SMMH) 422
## 4 dae8c7 Port Hospital 1762
## 5 acf422 Central Hospital 454
## 6 1a4ac9 Port Hospital 1762
## 7 275cc7 Central Hospital 454
## 8 1389ca Missing 1469
## 9 057e7a Missing 1469
## 10 c97dd9 Port Hospital 1762
It then becomes easy to filter for case rows who were hospitalized at a “small” hospital, say, a hospital that admitted fewer than 500 patients:
linelist %>%
select(case_id, hospital) %>%
add_count(hospital) %>%
filter(n < 500)To retain all columns and rows (not summarize) and add a new variable for average group statistics, use mutate() instead of summarise().
This is useful if you want group statistics in the original dataset with all other column present - e.g. for calculations comparing one row to the group.
For example, this code below calculates the difference between a row’s delay-to-admission and the median delay for their hospital. The steps are:
days_onset_hosp (delay to hospitalisation) to create a new column containing the mean delay at the hospital of that rowlinelist %>%
# group data by hospital (no change to linelist yet)
group_by(hospital) %>%
# new columns
mutate(
# mean days to admission per hospital (rounded to 1 decimal)
group_delay_admit = round(mean(days_onset_hosp, na.rm=T), 1),
# difference between row's delay and mean delay at their hospital (rounded to 1 decimal)
diff_to_group = round(days_onset_hosp - group_delay_admit, 1)) %>%
# select certain rows only - for demonstration/viewing purposes
select(case_id, hospital, days_onset_hosp, group_delay_admit, diff_to_group)## # A tibble: 5,888 x 5
## # Groups: hospital [6]
## case_id hospital days_onset_hosp group_delay_admit diff_to_group
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 d8a13d St. Mark's Maternity Hospital (SMMH) 2 2.1 -0.1
## 2 8689b7 Missing 1 2.1 -1.1
## 3 11f8ea St. Mark's Maternity Hospital (SMMH) 2 2.1 -0.1
## 4 dae8c7 Port Hospital NA 2.1 NA
## 5 acf422 Central Hospital 1 1.9 -0.9
## 6 1a4ac9 Port Hospital 2 2.1 -0.1
## 7 275cc7 Central Hospital 1 1.9 -0.9
## 8 1389ca Missing 2 2.1 -0.1
## 9 057e7a Missing 1 2.1 -1.1
## 10 c97dd9 Port Hospital NA 2.1 NA
## # ... with 5,878 more rows
The verb select() works on grouped data, but the grouping columns are always included (even if not mentioned in select()).
If you do not want these grouping columns, use ungroup() first.
Here we briefly demonstrate grouping data with the base R function aggregate()
TO DO
Here are some useful resources for more information:
https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf
https://datacarpentry.org/R-genomics/04-dplyr.html
https://dplyr.tidyverse.org/reference/group_by.html
https://dplyr.tidyverse.org/articles/grouping.html
https://itsalocke.com/files/DataManipulationinR.pdf
You can perform any summary function on grouped data; see the Cheat Sheet here for more info: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf
This page describes common “joins” and also probabilistic matching between dataframes.
Load packages
pacman::p_load(
rio, # import/export
here, # relative filepaths
tidyverse, # data management/viz
RecordLinkage, # probabilistic matches
fastLink # probabilistic matches
)Because traditional joins (non-probabilistic) can be very specific, requiring exact string matches, you may need to do cleaning on the datasets prior to the join (e.g. change spellings, change case to all lower or upper).
Load data
Load the data
linelist <- rio::import("linelist_cleaned.csv")In the joining examples, we’ll use the following datasets:
linelist, containing only the columns case_id, date_onset, and hospital, and only the first 10 rowshosp_info, which contains more details about each hospital“miniature” linelist
Below is the miniature linelist used for demonstration purposes:
linelist_mini <- linelist %>% # start with original linelist
select(case_id, date_onset, hospital) %>% # select columns
head(10) # keep only the first 10 rowsHospital Information dataframe
Below is the separate dataframe with additional information about each hospital.
Because traditional (non-probabilistic) joins are case-sensitive and require exact string matches, we will clean-up the hosp_info dataset prior to the joins.
Identify differences
We need the values of hosp_name column in hosp_info dataframe to match the values of hospital column in the linelist dataframe.
Here are the values in linelist_mini:
unique(linelist_mini$hospital)## [1] "St. Mark's Maternity Hospital (SMMH)" "Missing" "Port Hospital"
## [4] "Central Hospital"
and here are the values in hosp_info:
unique(hosp_info$hosp_name)## [1] "central hospital" "military" "port" "St. Mark's" "ignace" "sisters"
Align matching values
We begin by cleaning the values in hosp_name. We use logic to code the values in the new column using case_when() (LINK). We correct the hospital names that exist in both dataframes, and leave the others as they are (see TRUE ~ hosp_name).
CAUTION: Typically, one should create a new column (e.g. hosp_name_clean), but for ease of demonstration we show modification of the old column
hosp_info <- hosp_info %>%
mutate(
hosp_name = case_when(
hosp_name == "military" ~ "Military Hospital",
hosp_name == "port" ~ "Port Hospital",
hosp_name == "St. Mark's" ~ "St. Mark's Maternity Hospital (SMMH)",
hosp_name == "central hospital" ~ "Central Hospital",
TRUE ~ hosp_name
)
)We now see that the hospital names that appear in both dataframe are aligned. There are some hospitals in hosp_info that are not present in linelist - we will deal with these later, in the join.
unique(hosp_info$hosp_name)## [1] "Central Hospital" "Military Hospital" "Port Hospital"
## [4] "St. Mark's Maternity Hospital (SMMH)" "ignace" "sisters"
If you need to convert to all values UPPER or lower case, use these functions from stringr, as shown in the page on characters/strings (LINK).
str_to_upper()
str_to_upper()
str_to_title()
dplyr offers several different joins. Below they are described, with some simple use cases. Many thanks to https://github.com/gadenbuie for the moving images!
General function structure
Any of these join commands can be run independently, like below.
An object is being created, or re-defined: dataframe 2 is being joined to dataframe 1, on the basis of matches between the “ID” column in df1 and “identifier” column in df2. Because this example uses left_join(), any rows in df2 that do not match to df1 will be dropped.
object <- left_join(df1, df2, by = c("ID" = "identifier"))The join commands can also be run within a pipe chain. The first dataframe df1 is known to be the dataframe that is being passed through the pipes. An example is shown below, in context with some additional non-important mutate() and filter() commands before the join.
object <- df1 %>%
filter(var1 == 2) %>% # for demonstration only
mutate(lag = day + 7) %>% # for demonstration only
left_join(df1, by = c("ID" = "identifier")) # join df2 to df1Join columns (by =)
You must specify the columns in each dataset in which the values must match, using the arguemnt by =. You have a few options:
by = "ID") - this only works if this exact column name is present in both dataframes!by = c("ID" = "Identifier") - use this if the column names are different in the 2 dataframesby = c("ID" = "Identifier", "date_onset" = "Date_of_Onset")) - this will require exact matches on multiple columns for rows to join.CAUTION: Joins are case-specific! Therefore it is useful to convert all values to lowercase or uppercase prior to joining. See the page on characters/strings.
A left or right join is commonly used to add information to a dataframe - new information is added only to rows that already exist in the baseline dataframe.
These are common joins in epidemiological work - they are used to add information from one dataset into another.
The order of the dataframes is important.
All rows of the baseline dataframe are kept. Information in the secondary dataframe is joined to the baseline dataframe only if there is a match via the identifier column(s). In addition:
* Rows in the secondary dataframe that do not match are dropped.
* If there are many baseline rows that match to one row in the secondary dataframe (many-to-one), the baseline information is added to each matching baseline row.
* If a baseline row matches to multiple rows in the secondary dataframe (one-to-many), all combinations are given, meaning new rows may be added to your returned dataframe!
Example
Below is the output of a left_join() of hosp_info (secondary dataframe) into linelist_mini (baseline dataframe). Note the following:
linelist_mini are keptlinelist_mini is duplicated (“Military Hospital”) because it matched to two rows in the secondary dataframe, so both combinations are returnedhosp_name) has disappeared because it is redundant with the identifier column in the primary dataset (hospital)hospital is “Other” or “Missing”), NA fills in the columns from the secondary dataframelinelist_mini %>%
left_join(hosp_info, by = c("hospital" = "hosp_name"))“Should I use a right join, or a left join?”
Most important is to ask “which dataframe should retain all of its rows?” - use this one as the baseline.
The two commands below achieve the same output - 10 rows of hosp_info joined into a linelist_mini baseline. However, the column order will differ based on whether hosp_info arrives from the right (in the left join) or arrives from the left (in the right join). The order of the rows may also shift consequently.
Also consider whether your use-case is within a pipe chain (%>%). If the dataset in the pipes is the baseline, you will likely use a left join to add data to it.
# The two commands below achieve the same data, but with differently ordered rows and columns
left_join(linelist_mini, hosp_info, by = c("hospital" = "hosp_name"))
right_join(hosp_info, linelist_mini, by = c("hosp_name" = "hospital"))A full join is the most inclusive of the joins - it returns all rows from both dataframes.
If there are any rows present in one and not the other (where no match was found), the dataframe will become wider as NA values are added to fill-in. Watch the number of columns and rows carefully and troubleshoot case-sensitivity and exact string matches.
Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.
Example
Below is the output of a full_join() of hosp_info into linelist_mini. Note the following:
linelist_mini) are kepthospital)NA fills in where baseline rows did not match to secondary rows (hospital was “Other” or “Missing”), or the opposite (where hosp_name was “ignace” or “sisters”)linelist_mini %>%
full_join(hosp_info, by = c("hospital" = "hosp_name"))An inner join is the most restrictive of the joins - it returns only rows with matches across both dataframes.
This means that your original dataset may reduce in number of rows. Adjustment of the “baseline” (first) dataframe will not impact which records are returned, but it will impact the column order, row order, and which identifier column is retained.
Example
Below is the output of an inner_join() of linelist_mini (baseline) with hosp_info (secondary). Note the following:
hospital is “Missing” or “Other” are removed because had no match in the secondary dataframehosp_name is “sisters” or “ignace” are removed as they have no match in the baseline dataframehospital)linelist_mini %>%
inner_join(hosp_info, by = c("hospital" = "hosp_name"))hosp_info %>%
inner_join(linelist_mini, by = c("hosp_name" = "hospital"))A semi join is a “filtering join” which uses another dataset not to add rows or columns, but to perform filtering.
A semi-join keeps all observations in dataframe 1 that have a match in dataframe 2 (but does not add new columns or duplicate any rows with multiple matches). Read more about filtering joins here.
The below code would return 0 rows, because the two dataframes are completely different - there are no rows that are in both.
hosp_info %>%
semi_join(linelist_mini, by = c("hosp_name" = "hospital"))The anti join is a “filtering join” that returns rows in dataframe 1 that do not have a match in dataframe 2.
Read more about filtering joins here.
Common scenarios for an anti-join include identifying records not present in another dataframe, troubleshooting spelling in a join (catching records that should have matched), and examining records that were excluded after another join.
As with right_join() and left_join(), the baseline dataframe (listed first) is important. The returned rows are from it only. Notice in the gif below that row in the non-baseline dataframe (purple 4) is not returned even though it does not match.
Simple example
For an example, let’s find the hosp_info hospitals that do not have any cases present in linelist_mini. We list hosp_info first, as the baseline dataframe. The two hospitals which are not present in linelist_mini are returned.
hosp_info %>%
anti_join(linelist_mini, by = c("hosp_name" = "hospital"))Example 2
For another example, let us say we ran an inner_join() between linelist_mini and hosp_info. This returns only 8 of the original 11 linelist_mini records.
linelist_mini %>%
inner_join(hosp_info, by = c("hospital" = "hosp_name"))To review the 3 linelist_mini records that were excluded in the inner join, we can run an anti-join with linelist_mini as the baseline dataframe.
linelist_mini %>%
anti_join(hosp_info, by = c("hospital" = "hosp_name"))To see the hosp_info records that were excluded in the inner join, we could also run an anti-join with hosp_info as the baseline dataframe.
Under construction - TBD
If you do not have a unique identifier common across datasets to join on, consider using a probabilistic matching algorithm. This would find matches between records based on similarity (e.g. Jaro–Winkler string distance, or numeric distance). Below is a simple example using the package fastLink .
Load packages
pacman::p_load(
tidyverse, # data manipulation and visualization
fastLink # record matching
)Here are two small example datasets that we will use to demonstrate the probabilistic matching:
The cases dataset has 9 records of patients who are awaiting test results.
The test_results dataset has 14 records and contains the column result, which we want to add to the records in cases based on probabilistic matching of records.
The fastLink() function from the fastLink package can be used to apply a matching algorithm. Here is the basic informaton. You can read more detail by entering ?fastLink in your console.
dfA = and dfB =varnames = give all column names to be used for matching. They must all exist in both dfA and dfB.stringdist.match = give columns from those in varnames to be evaluated on string “distance”.numeric.match = give columns from those in varnames to be evaluated on numeric distance.dedupe.matches = FALSE. The deduplication is done using Winkler’s linear assignment solution.Tip: split one date column into three separate numeric columns using day(), month(), and year() from lubridate package
The default threshold for matches is 0.94 (threshold.match =) but you can adjust it higher or lower. If you define the threshold, consider that higher thresholds could yield more false-negatives (rows that do not match which actually should match) and likewise a lower threshold could yield more false-positive matches.
Below, the data are matched on string distance across the name and district columns, and on numeric distance for year, month, and day of birth. A match threshold of 95% probability is set.
fl_output <- fastLink::fastLink(
dfA = cases,
dfB = results,
varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
stringdist.match = c("first", "middle", "last", "district"),
numeric.match = c("yr", "mon", "day"),
threshold.match = 0.95)##
## ====================
## fastLink(): Fast Probabilistic Record Linkage
## ====================
##
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## Calculating matches for each variable.
## Getting counts for parameter estimation.
## Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
## Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Deduping the estimated matches.
## Getting the match patterns for each estimated match.
Review matches
We defined the object returned from fastLink() as fl_output. It is of class list, and it actually contains several dataframes within it, detailing the results of the matching. One of these dataframes is matches, which contains the most likely matches across cases and results. You can access this “matches” dataframe with fl_output$matches. Below, it is saved as my_matches for ease of accessing later.
When my_matches is printed, you see two column vectors: the pairs of row numbers/indices (also called “rownames”) in cases (“inds.a”) and in results (“inds.b”) representing the best matches. If a row number from a datafrane is missing, then no match was found in the other at the specified match threshold.
# print matches
my_matches <- fl_output$matches
my_matches## inds.a inds.b
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 8 8
## 6 7 9
## 7 6 10
## 8 5 12
Things to note:
cases (for “Blessing Adebayo”, row 9) had no good match in results, so it is not present in my_matches.Join based on the probabilistic matches
To use these matches to join results to cases, one strategy is:
left_join() to join my_matches to cases (matching rownames in cases to “inds.a” in my_matches)left_join() to join results to cases (matching the newly-acquired “inds.b” in cases to rownames in `results``)Before the joins, we should clean the three datasets:
dfA and dfB should have their row numbers (“rowname”) converted to a proper columnmy_matches are converted to class character, so they can be joined to the character rownames# Clean data prior to joining
#############################
# convert cases rownames to a column
cases_clean <- cases %>% rownames_to_column()
# convert test_results rownames to a column
results_clean <- results %>% rownames_to_column()
# convert all columns in matches dataset to character, so they can be joined to the rownames
matches_clean <- my_matches %>%
mutate(across(everything(), as.character))
# Join matches to dfA, then add dfB
###################################
# column "inds.b" is added to dfA
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
# column(s) from dfB are added
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))As performed using the code above, the resulting dataframe complete will contain all columns from both cases and results. Many will be appended with suffixes “.x” and “.y”, because the column names would otherwise be duplicated.
Alternatively, to achieve only the “original” 9 records in cases with the new column(s) from results, use select() on results before the joins, so that it contains only rownames and the columns that you want to add to cases (e.g. the column result).
cases_clean <- cases %>% rownames_to_column()
results_clean <- results %>%
rownames_to_column() %>%
select(rowname, result) # select only certain columns
matches_clean <- my_matches %>%
mutate(across(everything(), as.character))
# joins
complete <- left_join(cases_clean, matches_clean, by = c("rowname" = "inds.a"))
complete <- left_join(complete, results_clean, by = c("inds.b" = "rowname"))If you want to subset either dataset to only the rows that matched, you can use the codes below:
cases_matched <- cases[my_matches$inds.a,] # Rows in cases that matched to a row in results
results_matched <- results[my_matches$inds.b,] # Rows in results that matched to a row in casesOr, to see only the rows that did not match:
cases_not_matched <- cases[!rownames(cases) %in% my_matches$inds.a,] # Rows in cases that did NOT match to a row in results
results_not_matched <- results[!rownames(results) %in% my_matches$inds.b,] # Rows in results that did NOT match to a row in casesProbabilistic matching can be used to deduplicate a dataset as well. See the page on deduplication for other methods of deduplication.
Here we began with the cases dataset, but are now calling it cases_dup, as it has 2 additional rows that could be duplicates of previous rows:
See “Tony” with “Anthony”, and “Marialisa Rodrigues” with “Maria Rodriguez”.
Run the same fastLink() command as before, but compare the cases_dup dataframe to itself. When the two dataframes provided are identical, the function assumes you want to de-duplicate.
## Run fastLink on the same dataset
dedupe_output <- fastLink(
dfA = cases_dup,
dfB = cases_dup,
varnames = c("gender", "first", "middle", "last", "yr", "mon", "day", "district"),
stringdist.match = c("first", "middle", "last", "district"),
numeric.match = c("yr", "mon", "day")
)##
## ====================
## fastLink(): Fast Probabilistic Record Linkage
## ====================
##
## If you set return.all to FALSE, you will not be able to calculate a confusion table as a summary statistic.
## dfA and dfB are identical, assuming deduplication of a single data set.
## Setting return.all to FALSE.
##
## Calculating matches for each variable.
## Getting counts for parameter estimation.
## Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Running the EM algorithm.
## Getting the indices of estimated matches.
## Parallelizing calculation using OpenMP. 1 threads out of 4 are used.
## Calculating the posterior for each pair of matched observations.
## Getting the match patterns for each estimated match.
fl.out must be of class fastLink.dedupe, or in other words, the result of either fastLink().
Now, you can review the potential duplicates with getMatches(). Provide the dataframe as both dfA = and dfB =, and provide the output of the fastLink() function as fl.out =.
## Run getMatches()
cases_dedupe <- getMatches(
dfA = cases_dup,
dfB = cases_dup,
fl.out = dedupe_output)See the right-most column, which indicates the duplicate IDs - the final two rows are identified as being likely duplicates of rows 2 and 3.
To return the row numbers of rows which are likely duplicates, you can count the number of rows per unique value in the dedupe.ids column, and then filter to keep only those with more than one row. In this case this leaves rows 2 and 3.
cases_dedupe %>%
count(dedupe.ids) %>%
filter(n > 1)## dedupe.ids n
## 1 2 2
## 2 3 2
To inspect the whole rows of the likely duplicates, put the row number in this command:
# displays row 2 and all likely duplicates of it
cases_dedupe[cases_dedupe$dedupe.ids == 2,] ## gender first middle last yr mon day district dedupe.ids
## 2 M Anthony B. Smith 1970 9 19 River 2
## 10 M Tony B. Smith 1970 9 19 River 2
See this vignette on fastLink at the package’s Github page
Publication describing methodolgy of fastLink
Publication describing RecordLinkage package
This tab demonstrates use of the stringr package to evaluate and manage character (strings).
str_length(), str_sub(), word()str_c(), str_glue(), str_order()str_sub(), str_replace_all()str_pad(), str_trunc(), str_wrap()str_to_upper(), str_to_title(), str_to_lower(), str_to_sentence()str_detect(), str_subset(), str_match()For ease of display most examples are shown acting on a short defined character vector, however they can easily be applied/adapted to a column within a dataset.
Much of this page is adapted from this online vignette
Install or load the stringr package.
# install or load the stringr package
pacman::p_load(stringr, # many functions for handling strings
tidyverse, # for optional data manipulation
tools # alternative for converting to title case
)A reference sheet for stringr functions can be found here
Use str_sub() to return only a part of a string. The function takes three main arguments:
A few notes on position numbers:
Below are some examples applied to the string “pneumonia”:
# start and end third from left (3rd letter from left)
str_sub("pneumonia", 3, 3)## [1] "e"
# 0 is not present
str_sub("pneumonia", 0, 0)## [1] ""
# 6th from left, to the 1st from right
str_sub("pneumonia", 6, -1)## [1] "onia"
# 5th from right, to the 2nd from right
str_sub("pneumonia", -5, -2)## [1] "moni"
# 4th from left to a position outside the string
str_sub("pneumonia", 4, 15)## [1] "umonia"
To extract the nth ‘word’, use word(), also from stringr. Provide the string(s), then the first word position to extract, and the last word position to extract.
By default, the separator between ‘words’ is assumed to be a space, unless otherwise indicated with sep = (e.g. sep = "_" when words are separated by underscores.
# strings to evaluate
chief_complaints <- c("I just got out of the hospital 2 days ago, but still can barely breathe.",
"My stomach hurts",
"Severe ear pain")
# extract 1st to 3rd words of each string
word(chief_complaints, start = 1, end = 3, sep = " ")## [1] "I just got" "My stomach hurts" "Severe ear pain"
str_sub() paired with the assignment operator (<-) can be used to modify a part of a string:
word <- "pneumonia"
# convert the third and fourth characters to X
str_sub(word, 3, 4) <- "XX"
word## [1] "pnXXmonia"
An example applied to multiple strings (e.g. a column). Note the expansion in length of “HIV”.
words <- c("pneumonia", "tubercolosis", "HIV")
# convert the third and fourth characters to X
str_sub(words, 3, 4) <- "XX"
words## [1] "pnXXmonia" "tuXXrcolosis" "HIXX"
str_length("abc")## [1] 3
Alternatively, use nchar() from base R
This section covers:
str_c(), str_glue(), and unite() to combine stringsstr_order() to arrange stringsstr_split() and separate() to split strings## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [3, 4].
To combine or concatenate multiple strings into one string, we suggest using str_c from stringr.
str_c("String1", "String2", "String3")## [1] "String1String2String3"
The argument sep = inserts characters between each input vectors (e.g. a comma or newline "\n")
str_c("String1", "String2", "String3", sep = ", ")## [1] "String1, String2, String3"
The argument collapse = is relevant if producing multiple elements. The example below shows the combination of two vectors into one (first names and last names). Another similar example might be jurisdictions and their case counts.
In this example:
sep value goes between each first and last namecollapse value goes between each peoplefirst_names <- c("abdul", "fahruk", "janice")
last_names <- c("hussein", "akinleye", "musa")
# sep is between the respective strings, while collapse is between the elements produced
str_c(first_names, last_names, sep = " ", collapse = "; ")## [1] "abdul hussein; fahruk akinleye; janice musa"
When printing such a combined string with newlines, you may need to wrap the whole phrase in cat() for the newlines to print properly:
# For newlines to print correctly, the phrase may need to be wrapped in cat()
cat(str_c(first_names, last_names, sep = " ", collapse = ";\n"))## abdul hussein;
## fahruk akinleye;
## janice musa
Use str_glue() to insert dynamic R code into a string. This is a very useful function for creating dynamic plot captions, as demonstrated below.
str_glue(""){} within the parentheses. There can be many curly brackets.\n within the quotes to force a new lineformat() to adjust date display, and use Sys.Date() to display the current dateA simple example, of a dynamic plot caption:
str_glue("The linelist is current to {format(Sys.Date(), '%d %b %Y')} and includes {nrow(linelist)} cases.")## The linelist is current to 21 Feb 2021 and includes 5888 cases.
An alternative format is to use placeholders within the brackets and define the code in separate arguments at the end of the str_glue() function, as below. This can improve code readability if the codes are long.
str_glue("Data source is the confirmed case linelist as of {current_date}.\nThe last case was reported hospitalized on {last_hospital}.\n{n_missing_onset} cases are missing date of onset and not shown",
current_date = format(Sys.Date(), '%d %b %Y'),
last_hospital = format(as.Date(max(linelist$date_hospitalisation, na.rm=T)), '%d %b %Y'),
n_missing_onset = nrow(linelist %>% filter(is.na(date_onset)))
)## Data source is the confirmed case linelist as of 21 Feb 2021.
## The last case was reported hospitalized on 30 Apr 2015.
## 241 cases are missing date of onset and not shown
Pulling from a dataframe
Sometimes, it is useful to pull data from dataframe and have it pasted together in sequence. Below is an example using this dataset to make a summary output of jurisdictions and the new and total cases:
DT::datatable(case_table, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Option 1:
Use str_c() with the dataframe and column names. Provide sep and collapse arguments.
str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = "; ")## [1] "Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
Add text “New Cases:” to the beginning of the summary by wrapping with a separate str_c() (if “New Cases:” was within the original str_c() it would appear multiple times).
str_c("New Cases: ", str_c(case_table$zone, case_table$new_cases, sep = " = ", collapse = "; "))## [1] "New Cases: Zone 1 = 3; Zone 2 = 0; Zone 3 = 7; Zone 4 = 0; Zone 5 = 15"
Option 2:
You can achieve a similar result with str_glue(), with newlines added automatically:
str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)")## Zone 1: 3 new cases (40 total cases)
## Zone 2: 0 new cases (4 total cases)
## Zone 3: 7 new cases (25 total cases)
## Zone 4: 0 new cases (10 total cases)
## Zone 5: 15 new cases (103 total cases)
To use str_glue() but have more control (e.g. to use double newlines), wrap it within str_c() and adjust the collapse value. You may need to print using cat() to correctly print the newlines.
case_summary <- str_c(str_glue("{case_table$zone}: {case_table$new_cases} new cases ({case_table$total_cases} total cases)"), collapse = "\n\n")
cat(case_summary) # print## Zone 1: 3 new cases (40 total cases)
##
## Zone 2: 0 new cases (4 total cases)
##
## Zone 3: 7 new cases (25 total cases)
##
## Zone 4: 0 new cases (10 total cases)
##
## Zone 5: 15 new cases (103 total cases)
Within a dataframe, bringing together character values from multiple columns can be achieved with unite() from tidyr. This is the opposite of separate().
Provide the name of the new united column. Then provide the names of the columns you wish to unite.
_, but this can be changed with the sep argument.remove = - removes the input columns from the data frame (TRUE by default)na.rm = - removes missing values while uniting (FALSE by default)Below, we unite the three symptom columns in this dataframe.
df_split %>%
unite(
col = "all_symptoms", # name of the new united column
c("sym_1", "sym_2", "sym_3"), # columns to unite
sep = ", ", # separator to use in united column
remove = TRUE, # if TRUE, removes input cols from the data frame
na.rm = TRUE # if TRUE, missing values are removed before uniting
)## case_ID all_symptoms outcome
## 1 1 jaundice, fever, chills Success
## 2 2 chills, aches, pains Failure
## 3 3 fever Failure
## 4 4 vomiting, diarrhoea Success
## 5 5 bleeding, from, gums, fever Success
## 6 6 rapid, pulse, headache Success
To split a string based on a pattern, use str_split(). It evaluates the strings and returns a list of character vectors consisting of the newly-split values.
The simple example below evaluates one string and splits it into three. By default it returns a list with one element (a character vector) for each string provided. If simplify = TRUE it returns a character matrix.
One string is provided, and returned is a list with one element, which is a character vector with three values
str_split("jaundice, fever, chills", ",")## [[1]]
## [1] "jaundice" " fever" " chills"
You can assign this as a named object, and access the nth symptom. To access a specific symptom you can use syntax like this: the_split_return_object[[1]][2], which would access the second symptom from the first evaluated string (“fever”). See the R basics page for more detail on accessing elements.
pt1_symptoms <- str_split("jaundice, fever, chills", ",")
pt1_symptoms[[1]][2] # extracts 2nd value from 1st (and only) element of the list## [1] " fever"
If multiple strings are evaluated, there will be more than one element in the returned list.
symptoms <- c("jaundice, fever, chills", # patient 1
"chills, aches, pains", # patient 2
"fever", # patient 3
"vomiting, diarrhoea", # patient 4
"bleeding from gums, fever", # patient 5
"rapid pulse, headache") # patient 6
str_split(symptoms, ",") # split each patient's symptoms## [[1]]
## [1] "jaundice" " fever" " chills"
##
## [[2]]
## [1] "chills" " aches" " pains"
##
## [[3]]
## [1] "fever"
##
## [[4]]
## [1] "vomiting" " diarrhoea"
##
## [[5]]
## [1] "bleeding from gums" " fever"
##
## [[6]]
## [1] "rapid pulse" " headache"
To return a “character matrix” instead, which may be useful if creating dataframe columns, set the argument simplify = TRUE as shown below:
str_split(symptoms, ",", simplify = T)## [,1] [,2] [,3]
## [1,] "jaundice" " fever" " chills"
## [2,] "chills" " aches" " pains"
## [3,] "fever" "" ""
## [4,] "vomiting" " diarrhoea" ""
## [5,] "bleeding from gums" " fever" ""
## [6,] "rapid pulse" " headache" ""
You can also adjust the number of splits to create with the n = argument. For example, this restricts the number of splits (from the left side) to 2 splits. The further commas remain within the second split.
str_split(symptoms, ",", simplify = T, n = 2)## [,1] [,2]
## [1,] "jaundice" " fever, chills"
## [2,] "chills" " aches, pains"
## [3,] "fever" ""
## [4,] "vomiting" " diarrhoea"
## [5,] "bleeding from gums" " fever"
## [6,] "rapid pulse" " headache"
Note - the same outputs can be achieved with str_split_fixed(), in which you do not* give the simplify argument, but must instead designate the number of columns (n).*
str_split_fixed(symptoms, ",", n = 2)Within a dataframe, to split one character column into other columns use use separate() from dplyr.
If we have a simple dataframe df consisting of a case ID column, one character column with symptoms, and one outcome column:
First, provide the column to be separated. Then provide into = as a vector c( ) containing the new columns names, as shown below.
sep = - the separator, can be a character, or a number (interpreted as the character position to split at).remove = - FALSE by default, removes the input column)convert = - FALSE by default, will cause string “NA”s to become NA).extra = - this controls what happens if there are more values created by the separation than new columns named.
extra = "warn" means you will see a warning but it will drop excess values (the default)extra = "drop" means the excess values will be dropped with no warningextra = "merge" will only split to the number of new columns listed in into - this setting will preserve all your dataAn example with extra = "merge" - no data is lost and third symptoms are combined into the second new named column:
# third symptoms combined into second new column
df %>%
separate(symptoms, into = c("sym_1", "sym_2"), sep=",", extra = "merge")## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
## case_ID sym_1 sym_2 outcome
## 1 1 jaundice fever, chills Success
## 2 2 chills aches, pains Failure
## 3 3 fever <NA> Failure
## 4 4 vomiting diarrhoea Success
## 5 5 bleeding from gums fever Success
## 6 6 rapid pulse headache Success
With the default extra = "drop", a warning is given but the third symptoms are lost:
# third symptoms are lost
df %>%
separate(symptoms, into = c("sym_1", "sym_2"), sep=",")## Warning: Expected 2 pieces. Additional pieces discarded in 2 rows [1, 2].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
## case_ID sym_1 sym_2 outcome
## 1 1 jaundice fever Success
## 2 2 chills aches Failure
## 3 3 fever <NA> Failure
## 4 4 vomiting diarrhoea Success
## 5 5 bleeding from gums fever Success
## 6 6 rapid pulse headache Success
CAUTION: If you do not provide enough into values for the new columns, your data may be truncated.
One solution to automatically make as many columns as needed could be:
TO DO
Several strings can be sorted by alphabetical order. str_order() returns the order, while str_sort() returns the strings in that order.
# strings
health_zones <- c("Alba", "Takota", "Delta")
# return the alphabetical order
str_order(health_zones)## [1] 1 3 2
# return the strings in alphabetical order
str_sort(health_zones)## [1] "Alba" "Delta" "Takota"
To use a different alphabet, add the argument locale =. See the full list of locales by entering stringi::stri_locale_list() in the R console.
To arrange strings in order of their value in another column, use arrange() like this:
TO DO
It is common to see base R functions paste() and paste0(), which concatenate vectors after converting all parts to character. The act similarly to str_c() but the syntax differs - the parts (either text or code/pre-defined objects) are separated by commas, for example: paste("Regional hospital needs", n_beds, "beds and", n_masks, "masks."). The sep and collapse arguments can be adjusted. By default sep is a space, unless using paste0() where there is no space between parts.
Use str_pad() to add characters to a string, to a minimum length. By default spaces are added, but you can also pad with other characters using the pad = argument.
# ICD codes of differing length
ICD_codes <- c("R10.13",
"R10.819",
"R17")
# ICD codes padded to 7 characters on the right side
str_pad(ICD_codes, 7, "right")## [1] "R10.13 " "R10.819" "R17 "
# Pad with periods instead of spaces
str_pad(ICD_codes, 7, "right", pad = ".")## [1] "R10.13." "R10.819" "R17...."
For example, to pad numbers with leading zeros (such as for hours or minutes), you can pad the number to minimum length of 2 with pad = "0".
# Add leading zeros to two digits (e.g. for times minutes/hours)
str_pad("4", 2, pad = "0") ## [1] "04"
# example using a numeric column named "hours"
# hours <- str_pad(hours, 2, pad = "0")str_trunc() sets a maximum length for each string. If a string exceeds this length, it is truncated (shortened) and an ellipsis (…) is included to indicate that the string was previously longer. Note that the ellipsis is counted in the length. The ellipsis characters can be changed with the argument ellipsis =. The optional side = argument specifies which where the ellipsis will appear within the truncated string (“left”, “right”, or “center”).
original <- "Symptom onset on 4/3/2020 with vomiting"
str_trunc(original, 10, "center")## [1] "Symp...ing"
Use str_trunc() to set a maximum length, and then use str_pad() to expand the very short strings to that truncated length. In the example below, 6 is set as the maximum length (one value is truncated), and then a very short value is padded to achieve length of 6.
# ICD codes of differing length
ICD_codes <- c("R10.13",
"R10.819",
"R17")
# truncate to maximum length of 6
ICD_codes_2 <- str_trunc(ICD_codes, 6)
ICD_codes_2## [1] "R10.13" "R10..." "R17"
# expand to minimum length of 6
ICD_codes_3 <- str_pad(ICD_codes_2, 6, "right")
ICD_codes_3## [1] "R10.13" "R10..." "R17 "
Use str_trim() to remove spaces, newlines (\n) or tabs (\t) on sides of a string input. Add "right" "left", or "both" to the command to specify which side to trim (e.g. str_trim(x, "right").
# ID numbers with excess spaces on right
IDs <- c("provA_1852 ", # two excess spaces
"provA_2345", # zero excess spaces
"provA_9460 ") # one excess space
# IDs trimmed to remove excess spaces on right side only
str_trim(IDs)## [1] "provA_1852" "provA_2345" "provA_9460"
Use str_squish() to remove repeated spaces that appear inside a string. For example, to convert double spaces into single spaces. It also removes spaces, newlines, or tabs on the outside of the string like str_trim().
# original contains excess spaces within string
str_squish(" Pt requires IV saline\n") ## [1] "Pt requires IV saline"
Enter ?str_trim, ?str_pad in your R console to see further details.
Use str_wrap() to wrap a long unstructured text into a structured paragraph with fixed line length. Provide the ideal character length for each line, and it applies an algorithm to insert newlines (\n) within the paragraph, as seen in the example below.
pt_course <- "Symptom onset 1/4/2020 vomiting chills fever. Pt saw traditional healer in home village on 2/4/2020. On 5/4/2020 pt symptoms worsened and was admitted to Lumta clinic. Sample was taken and pt was transported to regional hospital on 6/4/2020. Pt died at regional hospital on 7/4/2020."
str_wrap(pt_course, 40)## [1] "Symptom onset 1/4/2020 vomiting chills\nfever. Pt saw traditional healer in\nhome village on 2/4/2020. On 5/4/2020\npt symptoms worsened and was admitted\nto Lumta clinic. Sample was taken and pt\nwas transported to regional hospital on\n6/4/2020. Pt died at regional hospital\non 7/4/2020."
The base function cat() can be wrapped around the above command in order to print the output, displaying the new lines added.
cat(str_wrap(pt_course, 40))## Symptom onset 1/4/2020 vomiting chills
## fever. Pt saw traditional healer in
## home village on 2/4/2020. On 5/4/2020
## pt symptoms worsened and was admitted
## to Lumta clinic. Sample was taken and pt
## was transported to regional hospital on
## 6/4/2020. Pt died at regional hospital
## on 7/4/2020.
Often one must alter the case/capitalization of a string value, for example names of jursidictions. Use str_to_upper(), str_to_upper(), and str_to_title(), as shown below:
str_to_upper("California")## [1] "CALIFORNIA"
str_to_lower("California")## [1] "california"
Using *base** R, the above can also be achieved with toupper(), tolower().
Title case
Transforming the string so each word is capitalized can be achieved with str_to_title():
str_to_title("go to the US state of california ")## [1] "Go To The Us State Of California "
Use toTitleCase() from the tools package to achieve more nuanced capitalization (words like “to”, “the”, and “of” are not capitalized).
tools::toTitleCase("This is the US state of california")## [1] "This is the US State of California"
You can also use str_to_sentence(), which capitalizes only the first letter of the string.
str_to_sentence("the patient must be transported")## [1] "The patient must be transported"
Many stringr functions work to detect, locate, extract, match, replace, and split based on a specified pattern.
Use str_detect() as below to detect presence/absence of a pattern within a string. First list the string or vector to search in, and then the pattern to look for. Note that by default the search is case sensitive!
str_detect("primary school teacher", "teach")## [1] TRUE
The argument negate = can be included and set to TRUE if you want to know if the pattern is NOT present.
str_detect("primary school teacher", "teach", negate = TRUE)## [1] FALSE
To ignore case/capitalization, wrap the pattern within regex() and within regex() add the argument ignore_case = T.
str_detect("Teacher", regex("teach", ignore_case = T))## [1] TRUE
When str_detect() is applied to a character vector/column, it will return a TRUE/FALSE for each of the values in the vector.
# a vector/column of occupations
occupations <- c("field laborer",
"university professor",
"primary school teacher & tutor",
"tutor",
"nurse at regional hospital",
"lineworker at Amberdeen Fish Factory",
"physican",
"cardiologist",
"office worker",
"food service")
# Detect presence of pattern "teach" in each string - output is vector of TRUE/FALSE
str_detect(occupations, "teach")## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
If you need to count these, apply sum() to the output. This counts the number TRUE.
sum(str_detect(occupations, "teach"))## [1] 1
To search inclusive of multiple terms, include them separated by OR bars (|) within the pattern, as shown below:
sum(str_detect(occupations, "teach|professor|tutor"))## [1] 3
If you need to make a long list of search terms, you can combine them using str_c() and sep = |, define this is a character object, and reference it later more succinctly. The example below includes possible occupation search terms for frontline medical providers.
# search terms
occupation_med_frontline <- str_c("medical", "medicine", "hcw", "healthcare", "home care", "home health",
"surgeon", "doctor", "doc", "physician", "surgery", "peds", "pediatrician",
"intensivist", "cardiologist", "coroner", "nurse", "nursing", "rn", "lpn",
"cna", "pa", "physician assistant", "mental health",
"emergency department technician", "resp therapist", "respiratory",
"phlebotomist", "pharmacy", "pharmacist", "hospital", "snf", "rehabilitation",
"rehab", "activity", "elderly", "subacute", "sub acute",
"clinic", "post acute", "therapist", "extended care",
"dental", "dential", "dentist", sep = "|")
occupation_med_frontline## [1] "medical|medicine|hcw|healthcare|home care|home health|surgeon|doctor|doc|physician|surgery|peds|pediatrician|intensivist|cardiologist|coroner|nurse|nursing|rn|lpn|cna|pa|physician assistant|mental health|emergency department technician|resp therapist|respiratory|phlebotomist|pharmacy|pharmacist|hospital|snf|rehabilitation|rehab|activity|elderly|subacute|sub acute|clinic|post acute|therapist|extended care|dental|dential|dentist"
This command returns the number of occupations which contain any one of the search terms for front-line medical providers (occupation_med_frontline):
sum(str_detect(occupations, occupation_med_frontline))## [1] 2
Base R string search functions
The base function grepl() works similarly to str_detect(), in that it searches for matches to a pattern and returns a logical vector. The basic syntax is grepl(pattern, strings_to_search, ignore.case = FALSE, ...). One advantage is that the ignore.case argument is easier to write (there is no need to involve regex() function).
Likewise, the base functions sub() and gsub() act similarly to str_replace(). Their basic syntax is: gsub(pattern, replacement, strings_to_search, ignore.case = FALSE). sub() will replace the first instance of the pattern, whereas gsub() will replace all instances of the pattern.
Use str_replace_all() as a “find and replace” tool. First, provide the strings to be evaluated, then the pattern to be replaced, and then the replacement value. The example below replaces all instances of “dead” with “deceased”. Note, this IS case sensitive.
outcome <- c("Karl: dead",
"Samantha: dead",
"Marco: not dead")
str_replace_all(outcome, "dead", "deceased")## [1] "Karl: deceased" "Samantha: deceased" "Marco: not deceased"
To replace a pattern with NA, use str_replace_na(). The function str_replace() replaces only the first instance of the pattern within each evaluated string.
Within case_when()
str_detect() is often used within case_when() (from dplyr). Let’s say the occupations are a column in the linelist called occupations. The mutate() below creates a new column called is_educator by using conditional logic via case_when(). See the page on data cleaning to learn more about case_when().
df <- df %>%
mutate(is_educator = case_when(
# term search within occupation, not case sensitive
str_detect(occupations,
regex("teach|prof|tutor|university",
ignore_case = TRUE)) ~ "Educator",
# all others
TRUE ~ "Not an educator"))As a reminder, it may be important to add exclusion criteria to the conditional logic (negate = F):
df <- df %>%
# value in new column is_educator is based on conditional logic
mutate(is_educator = case_when(
# occupation column must meet 2 criteria to be assigned "Educator":
# it must have a search term AND NOT any exclusion term
# Must have a search term AND
str_detect(occupations,
regex("teach|prof|tutor|university", ignore_case = T)) &
# Must NOT have an exclusion term
str_detect(occupations,
regex("admin", ignore_case = T),
negate = T) ~ "Educator"
# All rows not meeting above criteria
TRUE ~ "Not an educator"))To locate the first position of a pattern, use str_locate(). It outputs a start and end position.
str_locate("I wish", "sh")## start end
## [1,] 5 6
Like other str functions, there is an "_all" version (str_locate_all()) which will return the positions of all instances of the pattern within each string. This outputs as a list.
phrases <- c("I wish", "I hope", "he hopes", "He hopes")
str_locate(phrases, "h" ) # position of *first* instance of the pattern## start end
## [1,] 6 6
## [2,] 3 3
## [3,] 1 1
## [4,] 4 4
str_locate_all(phrases, "h" ) # position of *every* instance of the pattern## [[1]]
## start end
## [1,] 6 6
##
## [[2]]
## start end
## [1,] 3 3
##
## [[3]]
## start end
## [1,] 1 1
## [2,] 4 4
##
## [[4]]
## start end
## [1,] 4 4
str_extract_all() returns the matching patterns themselves, which is most useful when you have offered several patterns via “OR” conditions. For example, looking in the string vector of occupations (see previous tab) for either “teach”, “prof”, or “tutor”.
str_extract_all() returns a list which contains all matches for each evaluated string. See below how occupation 3 has two pattern matches within it.
str_extract_all(occupations, "teach|prof|tutor")## [[1]]
## character(0)
##
## [[2]]
## [1] "prof"
##
## [[3]]
## [1] "teach" "tutor"
##
## [[4]]
## [1] "tutor"
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
##
## [[7]]
## character(0)
##
## [[8]]
## character(0)
##
## [[9]]
## character(0)
##
## [[10]]
## character(0)
str_extract() extracts only the first match in each evaluated string, producing a character vector with one element for each evaluated string. It returns NA where there was no match. The NAs can be removed by wrapping the returned vector with na.exclude(). Note how the second of occupation 3’s matches is not shown.
str_extract(occupations, "teach|prof|tutor")## [1] NA "prof" "teach" "tutor" NA NA NA NA NA NA
Subset, Count
Aligned functions include str_subset() and str_count().
str_subset() returns the actual values which contained the pattern:
str_subset(occupations, "teach|prof|tutor")## [1] "university professor" "primary school teacher & tutor" "tutor"
`str_count() returns a vector of numbers: the number of times a search term appears in each evaluated value.
str_count(occupations, regex("teach|prof|tutor", ignore_case = TRUE))## [1] 0 1 2 1 0 0 0 0 0 0
Groups within strings
str_match() TBD
Regular expressions, or “regex”, is a concise language for describing patterns in strings.
Much of this tab is adapted from this tutorial and this cheatsheet
Backslash \ as escape
The backslash \ is used to “escape” the meaning of the next character. This way, a backslash can be used to have a quote mark display within other quote marks (\") - the middle quote mark will not “break” the surrounding quote marks.
Note - thus, if you want to display a backslash, you must escape it’s meaning with *another backslash. So you must write two backslashes \\ to display one.
Special characters
| Special character | Represents |
|---|---|
"\\" |
backslash |
"\n" |
a new line (newline) |
"\"" |
double-quote within double quotes |
'\'' |
single-quote within single quotes |
"\“| grave accent”| carriage return“| tab”| vertical tab"` |
backspace |
Run ?"'" in the R Console to display a complete list of these special characters (it will appear in the RStudio Help pane).
If you are not familiar with it, a regular expression can look like an alien language:
A regular expression is applied to extract specific patterns from unstructured text - for example medical notes, chief complaint, matient history, or other free text columns in a dataset.
There are four basic tools one can use to create a basic regular expression:
Character sets
Character sets, are a way of expressing listing options for a character match, within brackets. So any a match will be triggered if any of the characters within the brackets are found in the string. For example, to look for vowels one could use this character set: “[aeiou]”. Some other common character sets are:
| Character set | Matches for |
|---|---|
"[A-Z]" |
any single capital letter |
"[a-z]" |
any single lowercase letter |
"[0-9]" |
any digit |
[:alnum:] |
any alphanumeric character |
[:digit:] |
any numeric digit |
[:alpha:] |
any letter (upper or lowercase) |
[:upper:] |
any uppercase letter |
[:lower:] |
any lowercase letter |
Character sets can be combined within one bracket (no spaces!), such as "[A-Za-z]" (any upper or lowercase letter), or another example "[t-z0-5]" (lowercase t through z OR number 0 through 5).
Meta characters
Meta characters are shorthand for character sets. Some of the important ones are listed below:
| Meta character | Represents |
|---|---|
"\\s" |
a single space |
"\\w" |
any single alphanumeric character (A-Z, a-z, or 0-9) |
"\\d" |
any single numeric digit (0-9) |
Quantifiers
Typically you do not want to search for a match on only one character. Quantifiers allow you to designate the length of letters/numbers to allow for the match.
Quantifiers are numbers written within curly brackets { } after the character they are quantifying, for example,
"A{2}" will return instances of two capital A letters."A{2,4}" will return instances of between two and four capital A letters (do not put spaces!)."A{2,}" will return instances of two or more capital A letters."A+" will return instances of one or more capital A letters (group extended until a different character is encountered).* asterisk to return zero or more matches (useful if you are not sure the pattern is present)Using the + plus symbol as a quantifier, the match will occur until a different character is encountered. For example, this expression will return all words (alpha characters: "[A-Za-z]+"
# test string for quantifiers
test <- "A-AA-AAA-AAAA"When a quantifier of {2} is used, only pairs of consecutive A’s are returned. Two pairs are identified within AAAA.
str_extract_all(test, "A{2}")## [[1]]
## [1] "AA" "AA" "AA" "AA"
When a quantifier of {2,4} is used, groups of consecutive A’s that are two to four in length are returned.
str_extract_all(test, "A{2,4}")## [[1]]
## [1] "AA" "AAA" "AAAA"
With the quantifier +, groups of one or more are returned:
str_extract_all(test, "A+")## [[1]]
## [1] "A" "AA" "AAA" "AAAA"
Relative position
These express requirements for what precedes or follows a pattern. For example, to extract sentences, “two numbers that are followed by a period” (""). (?<=\.)\s(?=[A-Z])
str_extract_all(test, "")## [[1]]
## [1] "A" "-" "A" "A" "-" "A" "A" "A" "-" "A" "A" "A" "A"
| Position statement | Matches to |
|---|---|
"(?<=b)a" |
“a” that is preceded by a “b” |
"(?<!b)a" |
“a” that is NOT preceded by a “b” |
"a(?=b)" |
“a” that is followed by a “b” |
"a(?!b)" |
“a” that is NOT followed by a “b” |
Groups
Capturing groups in your regular expression is a way to have a more organized output upon extraction.
Regex examples
Below is a free text for the examples. We will try to extract useful information from it using a regular expression search term.
pt_note <- "Patient arrived at Broward Hospital emergency ward at 18:00 on 6/12/2005. Patient presented with radiating abdominal pain from LR quadrant. Patient skin was pale, cool, and clammy. Patient temperature was 99.8 degrees farinheit. Patient pulse rate was 100 bpm and thready. Respiratory rate was 29 per minute."This expression matches to all words (any character until hitting non-character such as a space):
str_extract_all(pt_note, "[A-Za-z]+")## [[1]]
## [1] "Patient" "arrived" "at" "Broward" "Hospital" "emergency" "ward" "at" "on" "Patient"
## [11] "presented" "with" "radiating" "abdominal" "pain" "from" "LR" "quadrant" "Patient" "skin"
## [21] "was" "pale" "cool" "and" "clammy" "Patient" "temperature" "was" "degrees" "farinheit"
## [31] "Patient" "pulse" "rate" "was" "bpm" "and" "thready" "Respiratory" "rate" "was"
## [41] "per" "minute"
The expression "[0-9]{1,2}" matches to consecutive numbers that are 1 or 2 digits in length. It could also be written "\\d{1,2}", or "[:digit:]{1,2}".
str_extract_all(pt_note, "[0-9]{1,2}")## [[1]]
## [1] "18" "00" "6" "12" "20" "05" "99" "8" "10" "0" "29"
str_split(pt_note, ".")## [[1]]
## [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [51] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [101] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [151] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [201] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [251] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [301] "" "" "" "" "" "" "" "" ""
This expression will extract all sentences (assuming first letter is capitalized, and the sentence ends with a period). The pattern reads in English as: "A capital letter followed by some lowercase letters, a space, some letters, a space,
str_extract_all(pt_note, "[A-Z][a-z]+\\s\\w+\\s\\d{1,2}\\s\\w+\\s*\\w*")## [[1]]
## character(0)
You can view a useful list of regex expressions and tips on page 2 of this cheatsheet
Also see this tutorial.
This page covers the following subjects:
Load packages
pacman::p_load(tidyverse, # deduplication, grouping, and slicing functions
janitor, # function for reviewing duplicates
stringr # for string searches, can be used in "rolling-up" values
) Example dataset
For demonstration, we will use the fake dataset below. It is a record of COVID-19 phone encounters, including with contacts and with cases.
recordID (computer glitch)recordIDThis tab uses the dataset from the Preparation tab to describe how to review and remove duplicate rows in a dataframe. It also show how to handle duplicate elements in a vector.
To quickly review rows that have duplicates, you can use get_dupes() from the janitor package. By default, all columns are considered when duplicates are evaluated - rows returned are 100% duplicates considering the values in all columns.
In the obs dataframe, the first two rows that are 100% duplicates - they have the same value in every column (including the recordID column, which is supposed to be unique - it must be some computer glitch). The returned dataframe automatically includes a new column dupe_count, showing the number of rows with that combination of duplicate values.
# 100% duplicates across all columns
obs %>%
janitor::get_dupes()However, if we choose to ignore recordID, the 3rd and 4th rows rows are also duplicates. That is, they have the same values in all columns except for recordID. You can specify specific columns to be ignored in the function using a - minus symbol.
# Duplicates when column recordID is not considered
obs %>%
janitor::get_dupes(-recordID) # if multiple columns, wrap them in c()You can also positively specify the columns to consider. Below, only rows that have the same values in the name and purpose columns are returned. Notice how “amrish” now has dupe_count equal to 3 to reflect his three “contact” encounters.
*Scroll left for more rows**
# duplicates based on name and purpose columns ONLY
obs %>%
janitor::get_dupes(name, purpose)See ?get_dupes for more details, or see this online reference
To keep only unique rows of a dataframe, use distinct() from dplyr. Rows that are duplicates are removed such that only the first of such rows is kept. By default, “first” means the highest rownumber (order of rows top-to-bottom). Only unique rows are kept. In the example below, one duplicate row (the first row, for “adam”) has been removed (n is now 18, not 19 rows).
Scroll to the left to see the entire dataframe
# added to a chain of pipes (e.g. data cleaning)
obs %>%
distinct(across(-recordID), # reduces dataframe to only unique rows (keeps first one of any duplicates)
.keep_all = TRUE)
# if outside pipes, include the data as first argument
# distinct(obs)CAUTION: If using distinct() on grouped data, the function will apply to each group.
Deduplicate based on specific columns
You can also specify columns to be the basis for de-duplication. In this way, the de-duplication only applies to rows that are duplicates within the specified columns. Unless specified with .keep_all = TRUE, all columns not mentioned will be dropped.
In the example below, the de-duplication only applies to rows that have identical values for name and purpose columns. Thus, “brian” has only 2 rows instead of 3 - his first “contact” encounter and his only “case” encounter. To adjust so that brian’s latest encounter of each purpose is kept, see the tab on Slicing within groups.
Scroll to the left to see the entire dataframe
# added to a chain of pipes (e.g. data cleaning)
obs %>%
distinct(name, purpose, .keep_all = TRUE) %>% # keep rows unique by name and purpose, retain all columns
arrange(name) # arrange for easier viewingThe function duplicated() from base R will evaluate a vector (column) and return a logical vector of the same length (TRUE/FALSE). The first time a value appears, it will return FALSE (not a duplicate), and subsequent times that value appears it will return TRUE. Note how NA is treated the same as any other value.
x <- c(1, 1, 2, NA, NA, 4, 5, 4, 4, 1, 2)
duplicated(x)## [1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
To return only the duplicated elements, you can use brackets to subset the original vector:
x[duplicated(x)]## [1] 1 NA 4 4 1 2
To return only the unique elements, use unique() from base R. To remove NAs from the output, nest na.omit() within unique().
unique(x) # alternatively, use x[!duplicated(x)]## [1] 1 2 NA 4 5
unique(na.omit(x)) # remove NAs ## [1] 1 2 4 5
To return duplicate rows
In base R, you can also see which rows are 100% duplicates in a dataframe df with the command duplicated(df) (returns a logical vector of the rows).
Thus, you can also use the base subset [ ] on the dataframe to see the duplicated rows with df[duplicated(df),] (don’t forget the comma, meaning that you want to see all columns!).
To return unique rows
See the notes above. To see the unique rows you add the logical negator ! in front of the duplicated() function:
df[!duplicated(df),]
To return rows that are duplicates of only certain columns
Subset the df that is within the duplicated() parentheses, so this function will operate on only certain columns of the df.
To specify the columns, provide column numbers or names after a comma (remember, all this is within the duplicated() function).
Be sure to keep the comma , outside after the duplicated() function as well!
For example, to evaluate only columns 2 through 5 for duplicates: df[!duplicated(df[, 2:5]),]
To evaluate only columns name and purpose for duplicates: df[!duplicated(df[, c("name", "purpose)]),]
To “slice” a dataframe is useful in de-duplication if you have multiple rows per functional group (e.g. per “person”) and you only want to analyze one or some of them. Think of slicing a filter on the rows, by row number/position.
The basic slice() function accepts a number n. If positive, only the nth row is returned. If negative, all rows except the nth are returned.
Variations include:
slice_min() and slice_max() - to keep only the row with the minimium or maximum value of the specified column. Also worked with ordered factors.slice_head() and slice_tail - to keep only the first or last rowslice_sample() - to keep only a random sample of the rowsUse arguments n = or prop = to specify the number or proportion of rows to keep. If not using the function in a pipe chain, provide the data argument first (e.g. slice(df, n = 2)). See ?slice for more information.
Other arguments:
.order_by = - used in slice_min() and slice_max() this is a column to order by before slicing.
with_ties = - TRUE by default, meaning ties are kept.
.preserve = - FALSE by default. If TRUE then the grouping structure is re-calculated after slicing.
weight_by = - Optional, numeric column to weight by (bigger number more likely to get sampled). Also replace = for whether sampling is done with/without replacement.
TIP: When using slice_max() and slice_min(), be sure to specify/write the n = (e.g. n = 2, not just 2). Otherwise you may get an error Error:…is not empty.
NOTE: You may encounter the function top_n(), which has been superseded by the slice functions.
Here, the basic slice() function is used to keep only the 4th row:
obs %>%
slice(4) # keeps the 4th row onlyThe slice_*() functions can be very useful if applied to a grouped dataframe, as the slice operation is performed on each group separately. Use the function group_by() in conjunction with slice() to group the data and then take a slice from each group.
This is helpful for de-duplication if you have multiple rows per person but only want to keep one of them. You first use group_by() with key columns that are the same, and then use a slice function on a column that will differ among the grouped rows.
In the example below, to keep only the latest encounter per person, we group the rows by name and then use slice_max() with n = 1 on the date column. Be aware! To apply a function like slice_max() on dates, the date column must be class Date.
By default, “ties” (e.g. same date in this scenario) are kept, and we would still get multiple rows for some people (e.g. adam). To avoid this we set with_ties = FALSE. We get back only one row per person.
CAUTION: If using arrange(), specify .by_group = TRUE to have the data arranged within each group.
DANGER: If with_ties = FALSE, the first row of a tie is kept. This may be deceptive. See how for Mariah, she has two encounters on her latest date (6 Jan) and the first (earliest) one was kept. Likely, we want to keep her later encounter on that day. See how to “break” these ties in the next example.
obs %>%
group_by(name) %>% # group the rows by 'name'
slice_max(date, # keep row per group with maximum date value
n = 1, # keep only the single highest row
with_ties = F) # if there's a tie (of date), take the first rowBreaking “ties”
Multiple slice statements can be run to “break ties”. In this case, if a person has multiple encounters on their latest date, the encounter with the latest time is kept (lubridate::hm() is used to convert the character times to a sortable time class).
Note how now, the one row kept for “Mariah” on 6 Jan is encounter 3 from 08:32, not encounter 2 at 07:25.
# Example of multiple slice statements to "break ties"
obs %>%
group_by(name) %>%
# FIRST - slice by latest date
slice_max(date, n = 1, with_ties = TRUE) %>%
# SECOND - if there is a tie, select row with latest time; ties prohibited
slice_max(lubridate::hm(time), n = 1, with_ties = FALSE)In the example above, it would also have been possible to slice by encounter number, but we showed the slice on date and time for example purposes.
TIP: To use slice_max() or slice_min() on a “character” column, mutate it to an ordered factor class!
If you want to keep all records but mark only some for analysis, consider a two-step approach utilizing a unique recordID/encounter number:
case_when(), based on whether their record unique identifier (recordID in this example) is present in the reduced dataframe.# 1. Define dataframe of rows to keep for analysis
obs_keep <- obs %>%
group_by(name) %>%
slice_max(encounter, n = 1, with_ties = FALSE) # keep only latest encounter per person
# 2. Mark original dataframe
obs_marked <- obs %>%
# make new dup_record column
mutate(dup_record = case_when(
# if record is in obs_keep dataframe
recordID %in% obs_keep$recordID ~ "For analysis",
# all else marked as "Ignore" for analysis purposes
TRUE ~ "Ignore"))
# print
obs_markedCreate a column that contains a metric for the row’s completeness (non-missingness). This could be helpful when deciding which rows to prioritize over others when de-duplicating/slicing.
In this example, “key” columns over which you want to measure completeness are saved in a vector of column names.
Then the new column key_completeness is created with mutate(). The new value in each row is defined as a calculated fraction: the number of non-missing values in that row among the key columns, divided by the number of key columns.
This involves the function rowSums() from base R. Also used is ., which within piping refers to the dataframe at that point in the pipe (in this case, it is being subset with brackets []).
*Scroll to the right to see more rows**
# create a "key variable completeness" column
# this is a *proportion* of the columns designated as "key_vars" that have non-missing values
key_cols = c("personID", "name", "symptoms_ever")
obs %>%
mutate(key_completeness = rowSums(!is.na(.[,key_cols]))/length(key_cols)) This tab describes:
This tab uses the example dataset from the Preparation tab.
The code example below uses group_by() and summarise() to group rows by person, and then paste together all unique values within the grouped rows. Thus, you get one summary row per person. A few notes:
na.omit() with unique()na.omit() removes NA values, but if this is not desired it can be removed paste0(.x)…Scroll to the left to see more rows
# "Roll-up" values into one row per group (per "personID")
cases_rolled <- obs %>%
# create groups by name
group_by(personID) %>%
# order the rows within each group (e.g. by date)
arrange(date, .by_group = TRUE) %>%
# For each column, paste together all values within the grouped rows, separated by ";"
summarise(
across(everything(), # apply to all columns
~paste0(na.omit(.x), collapse = "; "))) # function is defined which combines non-NA valuesThe result is one row per group (ID), with entries arranged by date and pasted together.
This variation shows unique values only:
# Variation - show unique values only
cases_rolled <- obs %>%
group_by(personID) %>%
arrange(date, .by_group = TRUE) %>%
summarise(
across(everything(), # apply to all columns
~paste0(unique(na.omit(.x)), collapse = "; "))) # function is defined which combines unique non-NA valuesThis variation appends a suffix to each column.
In this case "_roll" to signify that it has been rolled:
# Variation - suffix added to column names
cases_rolled <- obs %>%
group_by(personID) %>%
arrange(date, .by_group = TRUE) %>%
summarise(
across(everything(),
list(roll = ~paste0(na.omit(.x), collapse = "; ")))) # _roll is appended to column namesIf you then want to evaluate all of the rolled values, and keep only a specific value (e.g. “best” or “maximum” value), you can use mutate() across the desired columns, to implement case_when(), which uses str_detect() from the stringr package to sequentially look for string patterns and overwrite the cell content.
# CLEAN CASES
#############
cases_clean <- cases_rolled %>%
# clean Yes-No-Unknown vars: replace text with "highest" value present in the string
mutate(across(c(contains("symptoms_ever")), # operates on specified columns (Y/N/U)
list(mod = ~case_when( # adds suffix "_mod" to new cols; implements case_when()
str_detect(.x, "Yes") ~ "Yes", # if "Yes" is detected, then cell value converts to yes
str_detect(.x, "No") ~ "No", # then, if "No" is detected, then cell value converts to no
str_detect(.x, "Unknown") ~ "Unknown", # then, if "Unknown" is detected, then cell value converts to Unknown
TRUE ~ as.character(.x)))), # then, if anything else if it kept as is
.keep = "unused") # old columns removed, leaving only _mod columnsNow you can see in the column symptoms_ever that if the person EVER said “Yes” to symptoms, then only “Yes” is displayed.
Sometimes, you may want to identify “likely” duplicates based on similarity (e.g. string “distance”) across several columns such as name, age, sex, date of birth, etc. You can apply a probabilistic matching algorithm to identify likely duplicates.
See the page on joining datasets for an explanation on this method (LINK). The section on Probabilistic Matching contains an example of applying these algorithms to compare a dataframe to itself, thus performing probabilistic de-duplication.
Much of the information in this page is adapted from these resources and vignettes online:
Load packages
pacman::p_load(
rio,
here,
purrr,
tidyverse
)Load data
# fake import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")The first 50 rows are displayed:
As an epidemiologist, it is a common need to repeat analyses on sub-groups (e.g. jurisdictions or sub-populations). Iterating with a for loop is one method to automate this process.
For example, let’s say we are making epidemic curves. We can make an epidemic curve (LINK) of all the cases:
# create 'incidence' object
outbreak <- incidence2::incidence(
linelist, # dataframe
date_index = date_onset, # date column
interval = "week", # aggregate counts weekly
groups = gender, # group values by gender
na_as_group = TRUE) # missing gender is own group## 241 missing observations were removed.
# plot epi curve
plot(outbreak, # name of incidence object
fill = "gender", # color bars by gender
color = "black", # outline color of bars
title = "Outbreak of ALL cases" # title
)To produce a separate plot for each hospital’s cases, we can put this code within a for loop. The elementary syntax is: for (item in vector) {do something}.
c(1,2,3,4,5...)) so that it can be used with index brackets [[x]] to extract and save. See subsequent example below.First, we save a named vector of the unique hospital names, hospital_names. The for loop will run once for each of these names (for (hosp in hospital_names) { and each time the current hospital name will be represented as “hosp” for use within the loop.
filter() is applied to linelist, such that column hospital must equal the current value of hosphospital_names# make vector of the hospital names
hospital_names <- unique(linelist$hospital)
# for each name ("hosp") in hospital_names, create and print the epi curve
for (hosp in hospital_names) {
# create incidence object specific to the current hospital
outbreak_hosp <- incidence2::incidence(
linelist %>% filter(hospital == hosp), # linelist is filtered to the current hospital
date_index = date_onset,
interval = "week",
groups = gender,
na_as_group = TRUE
)
# Create and save the plot. Title automatically adjusts to the current hospital
plot_hosp <- plot(outbreak_hosp,
fill = "gender",
color = "black",
title = stringr::str_glue("Epidemic of cases admitted to {hosp}")
)
# print the plot for the current hospital
print(plot_hosp)
} # end the for loop when it has been run for every hospital in hospital_names When “i” is a number:
What is “i”?
Often in loops it is useful to have the iterating “item” be a number - this allows indexing [[x]] to assign or extract. This is often written “i”.
Here is the above for loop, but written so that the vector is numeric c(1,2,3,4,5,6) and the hospital names are extracted from hospital_names with this indexing number (e.g. hospital_names[[2]])
hospital_names <- unique(linelist$hospital)
for (i in seq_along(hospital_names)) {
outbreak_i <- incidence2::incidence(
linelist %>% filter(hospital == hospital_names[[i]]),
date_index = date_onset,
interval = "week",
groups = gender,
na_as_group = TRUE
)
plot_i <- plot(outbreak_i,
fill = "gender",
color = "black",
title = stringr::str_glue("Epidemic of cases admitted to {hospital_names[[i]]}"))
print(plot_i)
}In a loop with many iterations it can run for minutes or even hours. Thus, it can be helpful to print the progress to the R console.
Below, code is placed within the loop to print every 100th number.
# loop with code to print progress every 100 iterations
for (row in 1:nrow(linelist)){
# print progress
if(row %% 100==0){
print(row)
}TO DO - Under construction
To iterate a function over columns in a dataframe:
linelist %>%
select(c(wt_kg, ht_cm, ct_blood, temp, bmi, days_onset_hosp))The R for Data Science page on iteration
A purrr tutorial
This tab demonstrates the use of gtsummary and dplyr to produce descriptive statistics.
Browse data: get a quick overview of your dataset using the skimr package
Summary statistics: mean, median, range, standard deviations, percentiles
Frequency / cross-tabs: counts and proportions
Statistical tests: t-tests, wilcoxon rank sum, kruskal-wallis and chi-squares
Correlations
This code chunk shows the loading of packages required for the analyses.
pacman::p_load(rio, # File import
here, # File locator
skimr, # get overview of data
tidyverse, # data management + ggplot2 graphics,
gtsummary, # summary statistics and tests
corrr # correlation analayis for numeric variables
)The example dataset used in this section:
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data.
# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")The first 50 rows of the linelist are displayed below.
## make sure that age variable is numeric
linelist <- linelist %>%
mutate(age = as.numeric(age))You can use the summary function to get information about variables and data sets.
For a numeric variable it will give you the minimum, median, mean and max as well as the 1st quartile (= 25th percentile) and the 3rd quartile (= 75th percentile)
## get information about a numeric variable
summary(linelist$age)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000000000 6.00000000000 13.00000000000 16.11189655172 23.00000000000 77.00000000000 88
You can also get an overview of each variable in a whole dataset.
## get information about each variable in a dataset
summary(linelist)## case_id generation date_infection date_onset date_hospitalisation date_outcome outcome
## Length:5888 Min. : 0.0000000000 Min. :2014-03-19 Min. :2014-04-07 Min. :2014-04-17 Min. :2014-04-19 Length:5888
## Class :character 1st Qu.:13.0000000000 1st Qu.:2014-09-06 1st Qu.:2014-09-16 1st Qu.:2014-09-19 1st Qu.:2014-09-26 Class :character
## Mode :character Median :16.0000000000 Median :2014-10-11 Median :2014-10-23 Median :2014-10-23 Median :2014-11-01 Mode :character
## Mean :16.5616508152 Mean :2014-10-22 Mean :2014-11-02 Mean :2014-11-03 Mean :2014-11-12
## 3rd Qu.:20.0000000000 3rd Qu.:2014-12-05 3rd Qu.:2014-12-18 3rd Qu.:2014-12-17 3rd Qu.:2014-12-28
## Max. :37.0000000000 Max. :2015-04-27 Max. :2015-04-30 Max. :2015-04-30 Max. :2015-06-04
## NA's :2087 NA's :241 NA's :936
## gender age age_unit age_years age_cat age_cat5 hospital
## Length:5888 Min. : 0.0000000000 Length:5888 Min. : 0.0000000000 0-4 :1090 0-4 :1090 Length:5888
## Class :character 1st Qu.: 6.0000000000 Class :character 1st Qu.: 6.0000000000 20-29 :1067 5-9 :1060 Class :character
## Mode :character Median :13.0000000000 Mode :character Median :13.0000000000 5-9 :1060 10-14 : 918 Mode :character
## Mean :16.1118965517 Mean :16.0624281609 10-14 : 918 15-19 : 835
## 3rd Qu.:23.0000000000 3rd Qu.:23.0000000000 15-19 : 835 20-24 : 643
## Max. :77.0000000000 Max. :77.0000000000 (Other): 830 (Other):1254
## NA's :88 NA's :88 NA's : 88 NA's : 88
## lon lat infector source wt_kg ht_cm
## Min. :-13.2727552093 Min. :8.44620611741 Length:5888 Length:5888 Min. : -9.0000000000 Min. : 7.0000000
## 1st Qu.:-13.2515303996 1st Qu.:8.46121793209 Class :character Class :character 1st Qu.: 41.0000000000 1st Qu.: 91.0000000
## Median :-13.2290787930 Median :8.46898925059 Mode :character Mode :character Median : 55.0000000000 Median :130.0000000
## Mean :-13.2338063400 Mean :8.46963750878 Mean : 53.0813519022 Mean :125.3828125
## 3rd Qu.:-13.2166314552 3rd Qu.:8.47954113438 3rd Qu.: 66.0000000000 3rd Qu.:159.0000000
## Max. :-13.2052242315 Max. :8.49174756781 Max. :115.0000000000 Max. :292.0000000
##
## ct_blood fever chills cough aches vomit temp
## Min. :16.0000000000 Length:5888 Length:5888 Length:5888 Length:5888 Length:5888 Min. :35.6000000000
## 1st Qu.:20.0000000000 Class :character Class :character Class :character Class :character Class :character 1st Qu.:38.2000000000
## Median :22.0000000000 Mode :character Mode :character Mode :character Mode :character Mode :character Median :38.8000000000
## Mean :21.1971807065 Mean :38.5649826389
## 3rd Qu.:22.0000000000 3rd Qu.:39.3000000000
## Max. :26.0000000000 Max. :40.7000000000
## NA's :128
## time_admission bmi days_onset_hosp
## Length:5888 Min. :-532.5443786980 Min. : 0.00000000000
## Class :character 1st Qu.: 24.5530554814 1st Qu.: 1.00000000000
## Mode :character Median : 32.3140948681 Median : 1.00000000000
## Mean : 47.7271973791 Mean : 2.05631308659
## 3rd Qu.: 49.9795756638 3rd Qu.: 3.00000000000
## Max. :2448.9795918400 Max. :22.00000000000
## NA's :241
skimr packageUsing the skimr package you can get a more detailed overview of each of the variables in your dataset.
## get information about each variable in a dataset
skim(linelist)| Name | linelist |
| Number of rows | 5888 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 4 |
| factor | 2 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| case_id | 0 | 1.00 | 6 | 6 | 0 | 5888 | 0 |
| outcome | 1323 | 0.78 | 5 | 7 | 0 | 2 | 0 |
| gender | 286 | 0.95 | 1 | 1 | 0 | 2 | 0 |
| age_unit | 0 | 1.00 | 5 | 6 | 0 | 2 | 0 |
| hospital | 0 | 1.00 | 5 | 36 | 0 | 6 | 0 |
| infector | 2088 | 0.65 | 6 | 6 | 0 | 2697 | 0 |
| source | 2088 | 0.65 | 5 | 7 | 0 | 2 | 0 |
| fever | 235 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| chills | 235 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| cough | 235 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| aches | 235 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| vomit | 235 | 0.96 | 2 | 3 | 0 | 2 | 0 |
| time_admission | 743 | 0.87 | 5 | 5 | 0 | 1072 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date_infection | 2087 | 0.65 | 2014-03-19 | 2015-04-27 | 2014-10-11 | 359 |
| date_onset | 241 | 0.96 | 2014-04-07 | 2015-04-30 | 2014-10-23 | 367 |
| date_hospitalisation | 0 | 1.00 | 2014-04-17 | 2015-04-30 | 2014-10-23 | 363 |
| date_outcome | 936 | 0.84 | 2014-04-19 | 2015-06-04 | 2014-11-01 | 371 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| age_cat | 88 | 0.99 | FALSE | 8 | 0-4: 1090, 20-: 1067, 5-9: 1060, 10-: 918 |
| age_cat5 | 88 | 0.99 | FALSE | 16 | 0-4: 1090, 5-9: 1060, 10-: 918, 15-: 835 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| generation | 0 | 1.00 | 16.559999999999999 | 5.79 | 0.000000000000000 | 13.000000000000000 | 16.000000000000000 | 20.00 | 37.00 | ▁▆▇▂▁ |
| age | 88 | 0.99 | 16.109999999999999 | 12.56 | 0.000000000000000 | 6.000000000000000 | 13.000000000000000 | 23.00 | 77.00 | ▇▅▂▁▁ |
| age_years | 88 | 0.99 | 16.059999999999999 | 12.58 | 0.000000000000000 | 6.000000000000000 | 13.000000000000000 | 23.00 | 77.00 | ▇▅▂▁▁ |
| lon | 0 | 1.00 | -13.230000000000000 | 0.02 | -13.270000000000000 | -13.250000000000000 | -13.230000000000000 | -13.22 | -13.21 | ▅▃▃▆▇ |
| lat | 0 | 1.00 | 8.470000000000001 | 0.01 | 8.449999999999999 | 8.460000000000001 | 8.470000000000001 | 8.48 | 8.49 | ▅▇▇▇▆ |
| wt_kg | 0 | 1.00 | 53.079999999999998 | 18.64 | -9.000000000000000 | 41.000000000000000 | 55.000000000000000 | 66.00 | 115.00 | ▁▃▇▅▁ |
| ht_cm | 0 | 1.00 | 125.379999999999995 | 49.62 | 7.000000000000000 | 91.000000000000000 | 130.000000000000000 | 159.00 | 292.00 | ▂▅▇▂▁ |
| ct_blood | 0 | 1.00 | 21.199999999999999 | 1.70 | 16.000000000000000 | 20.000000000000000 | 22.000000000000000 | 22.00 | 26.00 | ▂▃▇▃▁ |
| temp | 128 | 0.98 | 38.560000000000002 | 0.98 | 35.600000000000001 | 38.200000000000003 | 38.799999999999997 | 39.30 | 40.70 | ▁▂▃▇▁ |
| bmi | 0 | 1.00 | 47.729999999999997 | 59.66 | -532.539999999999964 | 24.550000000000001 | 32.310000000000002 | 49.98 | 2448.98 | ▇▂▁▁▁ |
| days_onset_hosp | 241 | 0.96 | 2.060000000000000 | 2.25 | 0.000000000000000 | 1.000000000000000 | 1.000000000000000 | 3.00 | 22.00 | ▇▁▁▁▁ |
gtsummary packageUsing gtsummary you can create a table with different summary statistics, for example mean, median, range, standard deviation and percentiles. You can also show these all in one table.
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column.
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with mean
tbl_summary(statistic = age ~ "{mean}")| Characteristic | N = 5,8881 |
|---|---|
| age | 16 |
| Unknown | 88 |
|
1
Mean
|
|
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with median
tbl_summary(statistic = age ~ "{median}")| Characteristic | N = 5,8881 |
|---|---|
| age | 13 |
| Unknown | 88 |
|
1
Median
|
|
The range here is the minimum and maximum values for the variable. (see percentiles for interquartile range) Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with range
tbl_summary(statistic = age ~ "{min}, {max}")| Characteristic | N = 5,8881 |
|---|---|
| age | 0, 77 |
| Unknown | 88 |
|
1
Range
|
|
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with standard deviation
tbl_summary(statistic = age ~ "{sd}")| Characteristic | N = 5,8881 |
|---|---|
| age | 13 |
| Unknown | 88 |
|
1
SD
|
|
To return percentiles you can type in one value that you would like, or you can type in multiple (e.g. to return the interquartile range).
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with interquartile range
tbl_summary(statistic = age ~ "{p25}, {p75}")| Characteristic | N = 5,8881 |
|---|---|
| age | 6, 23 |
| Unknown | 88 |
|
1
IQR
|
|
You can combine all of the previously shown elements in one table by choosing which statistics you want to show. To do this you need to tell the function that you want to get a table back by entering the type as “continuous2”.
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
## only keep variable of interest
select(age) %>%
## create summary table with interquartile range
tbl_summary(
## tell the function you want to get multiple statistics back
type = age ~ "continuous2",
## define which statistics you want to get back
statistic = age ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}")
)| Characteristic | N = 5,888 |
|---|---|
| age | |
| Mean (SD) | 16 (13) |
| Median (IQR) | 13 (6, 23) |
| Range | 0, 77 |
| Unknown | 88 |
dplyr packageYou can also use dplyr to create a table with different summary statistics, for example mean, median, range, standard deviation and percentiles. You can also show these all in one table. The difference with using dplyr is that the output is not automatically formatted as nicely as with gtsummary
Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
linelist %>%
## get the mean value of age while excluding missings
summarise(mean = mean(age, na.rm = TRUE))## mean
## 1 16.11189655172414
Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
linelist %>%
## get the median value of age while excluding missings
summarise(median = median(age, na.rm = TRUE))## median
## 1 13
Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
linelist %>%
## get the range value of age while excluding missings
summarise(range = range(age, na.rm = TRUE))## range
## 1 0
## 2 77
Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
linelist %>%
## get the range value of age while excluding missings
summarise(sd = sd(age, na.rm = TRUE))## sd
## 1 12.56378366437255
To return percentiles you can type in one value that you would like, or you can type in multiple (e.g. to return the interquartile range).
Note the argument na.rm = TRUE, which removes missing values from the calculation.
If missing values are not excluded, the returned value will be NA (missing).
linelist %>%
## get the default percentile values of age while excluding missings
## these are 0%, 25%, 50%, 75%, 100%
summarise(percentiles = quantile(age, na.rm = TRUE))## percentiles
## 1 0
## 2 6
## 3 13
## 4 23
## 5 77
linelist %>%
## get specified percentile values of age while excluding missings
## these are 0%, 50%, 75%, 98%
summarise(percentiles = quantile(age,
probs = c(.05, 0.5, 0.75, 0.98),
na.rm=TRUE))## percentiles
## 1 1
## 2 13
## 3 23
## 4 49
You can combine all of the previously shown elements in one table by choosing
which statistics you want to show. In dplyr you will need to use the str_c
function from stringr to combine outputs for the IQR and the range in to one
cell, separated by a comma.
Note that this automatically excludes all missing values. If missing values are not excluded, the returned value will be NA (missing). The number of missing values is seen in the Unknown column).
linelist %>%
summarise(
## get the mean
mean = mean(age, na.rm = TRUE),
## get the standard deviation
SD = sd(age, na.rm = TRUE),
## get the median
median = median(age, na.rm = TRUE),
## collapse the IQR separated by a comma
IQR = str_c(
quantile(age, probs = c(0.25, 0.75), na.rm = TRUE),
collapse = ", "
),
## collapse the range separated by a comma
Range = str_c(
range(age, na.rm = TRUE),
collapse = ", "
)
)## mean SD median IQR Range
## 1 16.11189655172414 12.56378366437255 13 6, 23 0, 77
gtsummary packageTODO: Note that percentages are calculated without missings
Using gtsummary you can create a table with different counts and proportions
for variables with two or more categories, as well as grouping by another variable.
To produce the counts of a single variable we can use the tbl_summary function.
Note that here, the fever variable is yes/no (dichotomous) and tbl_summary
automatically only presents the “yes” row.
To show all levels you could use the type argument to choose categorical,
e.g. tbl_summary(type = fever ~ "categorical").
linelist %>%
## only keep the variable interested in
select(fever) %>%
## produce summary table
tbl_summary()| Characteristic | N = 5,8881 |
|---|---|
| fever | 4,561 (81%) |
| Unknown | 235 |
|
1
n (%)
|
|
You can also show multiple variables below each other simply by adding them to
select.
linelist %>%
## only keep the variable interested in
select(fever, gender) %>%
## produce summary table
tbl_summary()| Characteristic | N = 5,8881 |
|---|---|
| fever | 4,561 (81%) |
| Unknown | 235 |
| gender | |
| f | 2,811 (50%) |
| m | 2,791 (50%) |
| Unknown | 286 |
|
1
n (%)
|
|
There are two options to produce a two-by-two table (i.e. comparing two variables).
One option is to use tbl_cross, however this function only accepts two variables
at once. The option below with tbl_summary allows more variables.
linelist %>%
## only keep the variable interested in
select(fever, outcome, gender) %>%
## produce summary table stratified by gender
tbl_summary(by = gender) %>%
## add a column for the totals
add_overall()## 286 observations missing `gender` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `gender` column before passing to `tbl_summary()`.
| Characteristic | Overall, N = 5,6021 | f, N = 2,8111 | m, N = 2,7911 |
|---|---|---|---|
| fever | 4,346 (81%) | 2,174 (81%) | 2,172 (81%) |
| Unknown | 220 | 113 | 107 |
| outcome | |||
| Death | 2,447 (56%) | 1,236 (57%) | 1,211 (56%) |
| Recover | 1,891 (44%) | 944 (43%) | 947 (44%) |
| Unknown | 1,264 | 631 | 633 |
|
1
n (%)
|
|||
Producing counts based on three variables (adding a stratifier).
## TODO: add stratified tables when available
# table_3vars <- table(linelist$fever, linelist$gender, linelist$outcome)
#
# ftable(table_3vars)dplyr packageCreating cross tabulations with dplyr is less straightforward, as this does not
fit within the tidyverse dataset structure. It is still useful to demonstrate
though as the data produced can be used for plotting reference ggplot section.
Another option is to use the janitor package tabyl function.
Producing counts and proportions for a single variable. To see how to do this for multiple variables - reference for-loop section.
linelist %>%
## count the variable of interest
count(fever) %>%
## calculate proportion
mutate(percentage = n / sum(n) * 100)## fever n percentage
## 1 no 1092 18.54619565217391
## 2 yes 4561 77.46263586956522
## 3 <NA> 235 3.99116847826087
Producing counts and proportions based on a grouping variable. Here we use the
dplyr group_by function, for more information see the
reference grouping and aggregating section.
You can calculate the percentages of the total by using ungroup() after count(...).
Note that it is possible to change the bellow table to wide format, making it
more like a two-by-two (cross tabulation), using the tidyr pivot_wider function.
This would be done by adding this to the end of the code blow:
pivot_wider(names_from = gender, values_from = c(n, percentage))
For more information see the reference pivot section.
linelist %>%
## do everything by gender
group_by(gender) %>%
## count the variable of interest
count(fever) %>%
## calculate proportion
## note that the denominator here is the sum of each gender
mutate(percentage = n / sum(n) * 100)## # A tibble: 9 x 4
## # Groups: gender [3]
## gender fever n percentage
## <chr> <chr> <int> <dbl>
## 1 f no 524 18.6
## 2 f yes 2174 77.3
## 3 f <NA> 113 4.02
## 4 m no 512 18.3
## 5 m yes 2172 77.8
## 6 m <NA> 107 3.83
## 7 <NA> no 56 19.6
## 8 <NA> yes 215 75.2
## 9 <NA> <NA> 15 5.24
Producing counts based on three variables (adding a stratifier).
linelist %>%
## do everything by gender and outcome
group_by(gender, outcome) %>%
## count the variable of interest
count(fever) %>%
## calculate the proportion
## note that the denominator here is the sum of each group combination
mutate(percentage = n / sum(n) * 100)## # A tibble: 27 x 5
## # Groups: gender, outcome [9]
## gender outcome fever n percentage
## <chr> <chr> <chr> <int> <dbl>
## 1 f Death no 243 19.7
## 2 f Death yes 940 76.1
## 3 f Death <NA> 53 4.29
## 4 f Recover no 155 16.4
## 5 f Recover yes 750 79.4
## 6 f Recover <NA> 39 4.13
## 7 f <NA> no 126 20.0
## 8 f <NA> yes 484 76.7
## 9 f <NA> <NA> 21 3.33
## 10 m Death no 222 18.3
## # ... with 17 more rows
gtsummary packagePerforming statistical tests of comparison with tbl_summary is done by using
add_p function and specifying which test to use.
It is possible to get p-values corrected for multiple testing by using the
add_q function.
Compare the difference in means for a continuous variable in two groups. For example compare the mean age by patient outcome.
linelist %>%
## only keep variables of interested
select(age, outcome) %>%
## produce summary table
tbl_summary(
## specify what statistic want to show
statistic = age ~ "{mean} ({sd})",
## specify the grouping variable
by = outcome) %>%
## specify what test want to perform
add_p(age ~ "t.test")## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
| Characteristic | Death, N = 2,5821 | Recover, N = 1,9831 | p-value2 |
|---|---|---|---|
| age | 17 (13) | 16 (13) | 0.032 |
| Unknown | 48 | 23 | |
|
1
Mean (SD)
2
Welch Two Sample t-test
|
|||
Compare the distribution of a continuous variable in two groups. The default is to use the Wilcoxon rank sum test and the median (IQR) when comparing two groups. However for non-normally distributed data or comparing multiple groups, the Kruskal-wallis test is more appropriate.
linelist %>%
## only keep variables of interested
select(age, outcome) %>%
## produce summary table
tbl_summary(
## specify what statistic want to show (default so could remove)
statistic = age ~ "{median} ({p25}, {p75})",
## specify the grouping variable
by = outcome) %>%
## specify what test want to perform (default so could leave brackets empty)
add_p(age ~ "wilcox.test")## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
| Characteristic | Death, N = 2,5821 | Recover, N = 1,9831 | p-value2 |
|---|---|---|---|
| age | 14 (7, 24) | 13 (6, 22) | 0.008999999999999999 |
| Unknown | 48 | 23 | |
|
1
Median (IQR)
2
Wilcoxon rank sum test
|
|||
Compare the distribution of a continuous variable in two or more groups, regardless of whether the data is normally distributed.
linelist %>%
## only keep variables of interested
select(age, outcome) %>%
## produce summary table
tbl_summary(
## specify what statistic want to show (default so could remove)
statistic = age ~ "{median} ({p25}, {p75})",
## specify the grouping variable
by = outcome) %>%
## specify what test want to perform
add_p(age ~ "kruskal.test")## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
| Characteristic | Death, N = 2,5821 | Recover, N = 1,9831 | p-value2 |
|---|---|---|---|
| age | 14 (7, 24) | 13 (6, 22) | 0.008999999999999999 |
| Unknown | 48 | 23 | |
|
1
Median (IQR)
2
Kruskal-Wallis rank sum test
|
|||
Compare the proportions of a categorical variable in two groups. The default is to perform a chi-squared test of independence with continuity correction, but if any expected call count is below 5 then a Fisher’s exact test is used.
linelist %>%
## only keep variables of interested
select(gender, outcome) %>%
## produce summary table
tbl_summary(
## specify the grouping variable
by = outcome
) %>%
## specify what test want to perform
add_p()## 1323 observations missing `outcome` have been removed. To include these observations, use `forcats::fct_explicit_na()` on `outcome` column before passing to `tbl_summary()`.
| Characteristic | Death, N = 2,5821 | Recover, N = 1,9831 | p-value2 |
|---|---|---|---|
| gender | 0.7 | ||
| f | 1,236 (51%) | 944 (50%) | |
| m | 1,211 (49%) | 947 (50%) | |
| Unknown | 135 | 92 | |
|
1
n (%)
2
Pearson's Chi-squared test
|
|||
dplyr packagePerforming statistical tests in dplyr alone is very dense, again because it
does not fit within the tidy-data framework. It requires using purrr to create
a list of dataframes for each of the subgroups you want to compare.
An easier alternative may be the rstatix package.
linelist %>%
## only keep variables of interest
select(age, outcome) %>%
## drop those missing outcome
filter(!is.na(outcome)) %>%
## specify the grouping variable
group_by(outcome) %>%
## create a subset of data for each group (as a list)
nest() %>%
## spread in to wide format
pivot_wider(names_from = outcome, values_from = data) %>%
mutate(
## calculate the mean age for the death group
Death_mean = map(Death, ~mean(.x$age, na.rm = TRUE)),
## calculate the sd among dead
Death_sd = map(Death, ~sd(.x$age, na.rm = TRUE)),
## calculate the mean age for the recover group
Recover_mean = map(Recover, ~mean(.x$age, na.rm = TRUE)),
## calculate the sd among recovered
Recover_sd = map(Recover, ~sd(.x$age, na.rm = TRUE)),
## using both grouped data sets compare mean age with a t-test
## keep only the p.value
t_test = map2(Death, Recover, ~t.test(.x$age, .y$age)$p.value)
) %>%
## drop datasets
select(-Death, -Recover) %>%
## return a dataset with the medians and p.value (drop missing)
unnest(cols = everything())## # A tibble: 1 x 5
## Death_mean Death_sd Recover_mean Recover_sd t_test
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16.6 12.6 15.8 12.6 0.0321
linelist %>%
## only keep variables of interest
select(age, outcome) %>%
## drop those missing outcome
filter(!is.na(outcome)) %>%
## specify the grouping variable
group_by(outcome) %>%
## create a subset of data for each group (as a list)
nest() %>%
## spread in to wide format
pivot_wider(names_from = outcome, values_from = data) %>%
mutate(
## calculate the median age for the death group
Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
## calculate the sd among dead
Death_iqr = map(Death, ~str_c(
quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE),
collapse = ", "
)),
## calculate the median age for the recover group
Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)),
## calculate the sd among recovered
Recover_iqr = map(Recover, ~str_c(
quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE),
collapse = ", "
)),
## using both grouped data sets compare age distribution with a wilcox test
## keep only the p.value
wilcox = map2(Death, Recover, ~wilcox.test(.x$age, .y$age)$p.value)
) %>%
## drop datasets
select(-Death, -Recover) %>%
## return a dataset with the medians and p.value (drop missing)
unnest(cols = everything())## # A tibble: 1 x 5
## Death_median Death_iqr Recover_median Recover_iqr wilcox
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 14 7, 24 13 6, 22 0.00878
linelist %>%
## only keep variables of interest
select(age, outcome) %>%
## drop those missing outcome
filter(!is.na(outcome)) %>%
## specify the grouping variable
group_by(outcome) %>%
## create a subset of data for each group (as a list)
nest() %>%
## spread in to wide format
pivot_wider(names_from = outcome, values_from = data) %>%
mutate(
## calculate the median age for the death group
Death_median = map(Death, ~median(.x$age, na.rm = TRUE)),
## calculate the sd among dead
Death_iqr = map(Death, ~str_c(
quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE),
collapse = ", "
)),
## calculate the median age for the recover group
Recover_median = map(Recover, ~median(.x$age, na.rm = TRUE)),
## calculate the sd among recovered
Recover_iqr = map(Recover, ~str_c(
quantile(.x$age, probs = c(0.25, 0.75), na.rm = TRUE),
collapse = ", "
)),
## using the original data set compare age distribution with a kruskal test
## keep only the p.value
kruskal = kruskal.test(linelist$age, linelist$outcome)$p.value
) %>%
## drop datasets
select(-Death, -Recover) %>%
## return a dataset with the medians and p.value (drop missing)
unnest(cols = everything())## # A tibble: 1 x 5
## Death_median Death_iqr Recover_median Recover_iqr kruskal
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 14 7, 24 13 6, 22 0.00878
linelist %>%
## do everything by gender
group_by(outcome) %>%
## count the variable of interest
count(gender) %>%
## calculate proportion
## note that the denominator here is the sum of each gender
mutate(percentage = n / sum(n) * 100) %>%
pivot_wider(names_from = outcome, values_from = c(n, percentage)) %>%
filter(!is.na(gender)) %>%
mutate(pval = chisq.test(linelist$gender, linelist$outcome)$p.value)## # A tibble: 2 x 8
## gender n_Death n_Recover n_NA percentage_Death percentage_Recover percentage_NA pval
## <chr> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 f 1236 944 631 47.9 47.6 47.7 0.723
## 2 m 1211 947 633 46.9 47.8 47.8 0.723
base packageYou can also just use the base functions to produce the results of statistical
tests. The outputs of these are however usually lists, and so are harder to
manipulate.
## compare mean age by outcome group with a t-test
t.test(age ~ outcome, data = linelist)##
## Welch Two Sample t-test
##
## data: age by outcome
## t = 2.1441049017989, df = 4216.0393810472, p-value = 0.03208150080434
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.06967852801116904 1.55794578138109108
## sample estimates:
## mean in group Death mean in group Recover
## 16.58524072612470 15.77142857142857
## compare age distribution by outcome group with a wilcox test
wilcox.test(age ~ outcome, data = linelist)##
## Wilcoxon rank sum test with continuity correction
##
## data: age by outcome
## W = 2596302.5, p-value = 0.008777554791062
## alternative hypothesis: true location shift is not equal to 0
## compare age distribution by outcome group with a kruskal-wallis test
kruskal.test(age ~ outcome, linelist)##
## Kruskal-Wallis rank sum test
##
## data: age by outcome
## Kruskal-Wallis chi-squared = 6.8675977150979, df = 1, p-value = 0.008777256233551
## compare the proportions in each group with a chi-squared test
chisq.test(linelist$gender, linelist$outcome)##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: linelist$gender and linelist$outcome
## X-squared = 0.12593664271323, df = 1, p-value = 0.7226828401587
Correlation between numeric variables can be investigated using the tidyverse
corrr package. It allows you to compute correlations using Pearson, Kendall
tau or Spearman rho. The package creates a table and also has a function to
automatically plot the values.
correlation_tab <- linelist %>%
## pick the numeric variables of interest
select(generation, age, ct_blood, days_onset_hosp, wt_kg, ht_cm) %>%
## create correlation table (using default pearson)
correlate()##
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
## remove duplicate entries (the table is mirrored)
correlation_tab <- correlation_tab %>%
shave()
## view correlation table
correlation_tab## # A tibble: 6 x 7
## term generation age ct_blood days_onset_hosp wt_kg ht_cm
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 generation NA NA NA NA NA NA
## 2 age -0.0104 NA NA NA NA NA
## 3 ct_blood 0.179 -0.00120 NA NA NA NA
## 4 days_onset_hosp -0.285 -0.00426 -0.601 NA NA NA
## 5 wt_kg 0.00334 0.836 0.00196 -0.0164 NA NA
## 6 ht_cm 0.00394 0.875 0.00630 -0.0115 0.883 NA
## plot correlations
rplot(correlation_tab)## Don't know how to automatically pick scale for object of type noquote. Defaulting to continuous.
Much of the information in this page is adapted from these resources and vignettes online:
This tab demonstrates the use of gtstummary and regression packages to look at associations between variables (e.g. odds ratios, risk ratios and hazard ratios)
This code chunk shows the loading of packages required for the analyses.
pacman::p_load(rio, # File import
here, # File locator
tidyverse, # data management + ggplot2 graphics,
stringr, # manipulate text strings
purrr, # loop over objects in a tidy way
gtsummary, # summary statistics and tests
broom, # tidy up results from regressions
parameters, # alternative to tidy up results from regressions
see
)The example dataset used in this section:
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data.
# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")The first 50 rows of the linelist are displayed below.
## make sure that age variable is numeric
linelist <- linelist %>%
mutate(age = as.numeric(age))
## define variables of interest
explanatory_vars <- c("gender", "fever", "chills", "cough", "aches", "vomit")
## make dichotomous variables in to 0/1
linelist <- linelist %>%
mutate(
## for each of the variables listed
across(
all_of(c(explanatory_vars, "outcome")),
## recode male, yes and death to 1; female, no and recover to 0
## otherwise set to missing
~case_when(
. %in% c("m", "yes", "Death") ~ 1,
. %in% c("f", "no", "Recover") ~ 0,
TRUE ~ NA_real_
))
)
## add in age_category to the explanatory vars
explanatory_vars <- c(explanatory_vars, "age_cat")
## drop rows with missing information for variables of interest
linelist <- linelist %>%
drop_na(any_of(c("outcome", explanatory_vars)))There are two options for doing univariate analysis.
You can use the gtsummary package or you can use the individual regression
functions available in base together with the broom package.
gtsummary packageuniv_tab <- linelist %>%
## select variables of interest
dplyr::select(explanatory_vars, outcome) %>%
## produce univariate table
tbl_uvregression(
## define regression want to run (generalised linear model)
method = glm,
## define outcome variable
y = outcome,
## define what type of glm want to run (logistic)
method.args = list(family = binomial),
## exponentiate the outputs to produce odds ratios (rather than log odds)
exponentiate = TRUE
)
## view univariate results table
univ_tab| Characteristic | N | OR1 | 95% CI1 | p-value |
|---|---|---|---|---|
| gender | 4,167 | 0.98 | 0.86, 1.10 | 0.7 |
| fever | 4,167 | 0.87 | 0.75, 1.02 | 0.094 |
| chills | 4,167 | 1.11 | 0.96, 1.30 | 0.2 |
| cough | 4,167 | 0.9300000000000000 | 0.79, 1.11 | 0.4 |
| aches | 4,167 | 0.94 | 0.77, 1.15 | 0.5 |
| vomit | 4,167 | 1.05 | 0.9300000000000000, 1.19 | 0.4 |
| age_cat | 4,167 | |||
| 0-4 | — | — | ||
| 5-9 | 1.01 | 0.82, 1.23 | >0.9 | |
| 10-14 | 1.09 | 0.88, 1.35 | 0.4 | |
| 15-19 | 1.00 | 0.8100000000000001, 1.23 | >0.9 | |
| 20-29 | 1.43 | 1.17, 1.76 | <0.001 | |
| 30-49 | 1.05 | 0.84, 1.31 | 0.7 | |
| 50-69 | 1.14 | 0.71, 1.87 | 0.6 | |
| 70+ | 0.84 | 0.16, 4.59 | 0.8 | |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
||||
baseUsing the glm function from the stats package (part of base R), you can
produce odds ratios.
For a single exposure variable, pass the names to glm and then use tidy from
the broom package to get the exponentiated odds ratio estimates and confidence
intervals. Here we demonstrate how to combine model outputs with a table of
counts.
model <- glm(
## define the variables of interest
outcome ~ age_cat,
## define the type of regression (logistic)
family = "binomial",
## define your dataset
data = linelist) %>%
## clean up the outputs of the regression (exponentiate and produce CIs)
tidy(
exponentiate = TRUE,
conf.int = TRUE)
linelist %>%
## get counts of variable of interest grouped by outcome
group_by(outcome) %>%
count(age_cat) %>%
## spread to wide format (as in cross-tabulation)
pivot_wider(names_from = outcome, values_from = n) %>%
## drop rows with missings
filter(!is.na(age_cat)) %>%
## merge with the outputs of the regression
bind_cols(., model) %>%
## only keep columns interested in
select(term, 2:3, estimate, conf.low, conf.high, p.value)## # A tibble: 8 x 7
## term `0` `1` estimate conf.low conf.high p.value
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 359 425 1.18 1.03 1.36 0.0186
## 2 age_cat5-9 353 420 1.01 0.823 1.23 0.961
## 3 age_cat10-14 270 349 1.09 0.883 1.35 0.417
## 4 age_cat15-19 285 336 0.996 0.806 1.23 0.969
## 5 age_cat20-29 284 482 1.43 1.17 1.76 0.000508
## 6 age_cat30-49 234 291 1.05 0.841 1.31 0.664
## 7 age_cat50-69 31 42 1.14 0.707 1.87 0.586
## 8 age_cat70+ 3 3 0.845 0.156 4.59 0.837
To run over several exposure variables to produce univariate odds ratios (i.e.
not controlling for each other), you can pass a vector of variable names to the
map function in the purrr package. This will loop over each of the variables
running regressions for each one.
models <- explanatory_vars %>%
## combine each name of the variables of interest with the name of outcome variable
str_c("outcome ~ ", .) %>%
## for each string above (outcome ~ "variable of interest)
map(
## run a general linear model
~glm(
## define formula as each of the strings above
as.formula(.x),
## define type of glm (logistic)
family = "binomial",
## define your dataset
data = linelist)
) %>%
## for each of the output regressions from above
map(
## tidy the output
~tidy(
## each of the regressions
.x,
## exponentiate and produce CIs
exponentiate = TRUE,
conf.int = TRUE)
) %>%
## collapse the list of regressions outputs in to one data frame
bind_rows()
## for each explanatory variable
univ_tab_base <- map(explanatory_vars,
~{linelist %>%
## group data set by outcome
group_by(outcome) %>%
## produce counts for variable of interest
count(.data[[.x]]) %>%
## spread to wide format (as in cross-tabulation)
pivot_wider(names_from = outcome, values_from = n) %>%
## drop rows with missings
filter(!is.na(.data[[.x]])) %>%
## change the variable of interest column to be called "variable"
rename("variable" = .x) %>%
## change the variable of interest column to be a character
## otherwise non-dichotomous (categorical) variables come out as factor and cant be merged
mutate(variable = as.character(variable))
}
) %>%
## collapse the list of count outputs in to one data frame
bind_rows() %>%
## merge with the outputs of the regression
bind_cols(., models) %>%
## only keep columns interested in
select(term, 2:3, estimate, conf.low, conf.high, p.value)Stratified analysis is currently still being worked on for gtsummary,
this page will be updated in due course.
gtsummary packageTODO
baseTODO
For multivariable analysis you can use a combination there is not much difference
between using gtsummary or broom to present the data.
The workflow is the same for both, as below, and only the last step of pulling a
table together is different.
## run a regression with all variables of interest
mv_reg <- explanatory_vars %>%
## combine all names of the variables of interest separated by a plus
str_c(collapse = "+") %>%
## combined the names of variables of interest with outcome in formula style
str_c("outcome ~ ", .) %>%
glm(## define type of glm (logistic)
family = "binomial",
## define your dataset
data = linelist)
## choose a model using forward selection based on AIC
## you can also do "backward" or "both" by adjusting the direction
final_mv_reg <- mv_reg %>%
step(direction = "forward", trace = FALSE)gtsummary packageThe gtsummary package provides the tbl_regression function, which will
take the outputs from a regression (glm in this case) and produce an easy
summary table.
You can also combine several different output tables produced by gtsummary with
the tbl_mege function.
## show results table of final regression
mv_tab <- tbl_regression(final_mv_reg, exponentiate = TRUE)
## combine with univariate results
tbl_merge(
tbls = list(univ_tab, mv_tab),
tab_spanner = c("**Univariate**", "**Multivariable**"))| Characteristic | Univariate | Multivariable | |||||
|---|---|---|---|---|---|---|---|
| N | OR1 | 95% CI1 | p-value | OR1 | 95% CI1 | p-value | |
| gender | 4,167 | 0.98 | 0.86, 1.10 | 0.7 | 0.95 | 0.83, 1.08 | 0.4 |
| fever | 4,167 | 0.87 | 0.75, 1.02 | 0.094 | 0.88 | 0.75, 1.03 | 0.10 |
| chills | 4,167 | 1.11 | 0.96, 1.30 | 0.2 | 1.11 | 0.96, 1.30 | 0.2 |
| cough | 4,167 | 0.9300000000000000 | 0.79, 1.11 | 0.4 | 0.9300000000000000 | 0.79, 1.11 | 0.4 |
| aches | 4,167 | 0.94 | 0.77, 1.15 | 0.5 | 0.9300000000000000 | 0.76, 1.14 | 0.5 |
| vomit | 4,167 | 1.05 | 0.9300000000000000, 1.19 | 0.4 | 1.05 | 0.9300000000000000, 1.19 | 0.4 |
| age_cat | 4,167 | ||||||
| 0-4 | — | — | — | — | |||
| 5-9 | 1.01 | 0.82, 1.23 | >0.9 | 1.00 | 0.82, 1.22 | >0.9 | |
| 10-14 | 1.09 | 0.88, 1.35 | 0.4 | 1.09 | 0.88, 1.35 | 0.4 | |
| 15-19 | 1.00 | 0.8100000000000001, 1.23 | >0.9 | 1.01 | 0.8100000000000001, 1.24 | >0.9 | |
| 20-29 | 1.43 | 1.17, 1.76 | <0.001 | 1.45 | 1.18, 1.78 | <0.001 | |
| 30-49 | 1.05 | 0.84, 1.31 | 0.7 | 1.07 | 0.85, 1.34 | 0.6 | |
| 50-69 | 1.14 | 0.71, 1.87 | 0.6 | 1.17 | 0.72, 1.92 | 0.5 | |
| 70+ | 0.84 | 0.16, 4.59 | 0.8 | 0.88 | 0.16, 4.77 | 0.9 | |
|
1
OR = Odds Ratio, CI = Confidence Interval
|
|||||||
basemv_tab_base <- final_mv_reg %>%
## get a tidy dataframe of estimates
broom::tidy(exponentiate = TRUE, conf.int = TRUE)
## combine univariate and multivariable tables
left_join(univ_tab_base, mv_tab_base, by = "term") %>%
## choose columns and rename them
select(
"characteristic" = term,
"recovered" = "0",
"dead" = "1",
"univ_or" = estimate.x,
"univ_ci_low" = conf.low.x,
"univ_ci_high" = conf.high.x,
"univ_pval" = p.value.x,
"mv_or" = estimate.y,
"mvv_ci_low" = conf.low.y,
"mv_ci_high" = conf.high.y,
"mv_pval" = p.value.y
)## # A tibble: 20 x 11
## characteristic recovered dead univ_or univ_ci_low univ_ci_high univ_pval mv_or mvv_ci_low mv_ci_high mv_pval
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 905 1183 1.31 1.20 1.43 1.31e- 9 1.37 1.06 1.78 0.0166
## 2 gender 914 1165 0.975 0.863 1.10 6.86e- 1 0.946 0.832 1.08 0.401
## 3 (Intercept) 323 465 1.44 1.25 1.66 4.89e- 7 1.37 1.06 1.78 0.0166
## 4 fever 1496 1883 0.874 0.747 1.02 9.43e- 2 0.876 0.747 1.03 0.0999
## 5 (Intercept) 1466 1852 1.26 1.18 1.35 2.29e-11 1.37 1.06 1.78 0.0166
## 6 chills 353 496 1.11 0.955 1.30 1.72e- 1 1.11 0.957 1.30 0.165
## 7 (Intercept) 263 360 1.37 1.17 1.61 1.09e- 4 1.37 1.06 1.78 0.0166
## 8 cough 1556 1988 0.933 0.785 1.11 4.33e- 1 0.935 0.786 1.11 0.444
## 9 (Intercept) 1625 2111 1.30 1.22 1.39 2.21e-15 1.37 1.06 1.78 0.0166
## 10 aches 194 237 0.940 0.770 1.15 5.48e- 1 0.935 0.765 1.14 0.511
## 11 (Intercept) 915 1153 1.26 1.16 1.37 1.77e- 7 1.37 1.06 1.78 0.0166
## 12 vomit 904 1195 1.05 0.928 1.19 4.44e- 1 1.05 0.930 1.19 0.423
## 13 (Intercept) 359 425 1.18 1.03 1.36 1.86e- 2 1.37 1.06 1.78 0.0166
## 14 age_cat5-9 353 420 1.01 0.823 1.23 9.61e- 1 1.00 0.821 1.22 0.979
## 15 age_cat10-14 270 349 1.09 0.883 1.35 4.17e- 1 1.09 0.883 1.35 0.415
## 16 age_cat15-19 285 336 0.996 0.806 1.23 9.69e- 1 1.01 0.814 1.24 0.956
## 17 age_cat20-29 284 482 1.43 1.17 1.76 5.08e- 4 1.45 1.18 1.78 0.000378
## 18 age_cat30-49 234 291 1.05 0.841 1.31 6.64e- 1 1.07 0.849 1.34 0.579
## 19 age_cat50-69 31 42 1.14 0.707 1.87 5.86e- 1 1.17 0.716 1.92 0.536
## 20 age_cat70+ 3 3 0.845 0.156 4.59 8.37e- 1 0.876 0.161 4.77 0.872
This section shows how to produce a plot with the outputs of your regression.
There are two options, you can build a plot yourself using ggplot2 or use a
package called
ggplot2 package## remove the intercept term from your multivariable results
mv_tab_base %>%
filter(term != "(Intercept)") %>%
## plot with variable on the y axis and estimate (OR) on the x axis
ggplot(aes(x = estimate, y = term)) +
## show the estimate as a point
geom_point() +
## add in an error bar for the confidence intervals
geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) +
## show where OR = 1 is for reference as a dashed line
geom_vline(xintercept = 1, linetype = "dashed")easystats packagesThe alternative if you do not want to decide all of the different things required
for a ggplot, is to use a combination of easystats packages.
In this case the paramaters package function model_paramets does the equivalent
of broom package function tidy. The see package then accepts those outputs
and creates a default forest plot as a ggplot object.
## remove the intercept term from your multivariable results
final_mv_reg %>%
model_parameters(exponentiate = TRUE) %>%
plot()Much of the information in this page is adapted from these resources and vignettes online:
This page will show you two ways to standardize an outcome, such as hospitalizations or mortality, by characteristics such as age and sex.
{#title_tag }There are two main ways to standardize: direct and indirect standardization. Let’s say we would like to standardize mortality by age and sex for country A and country B, and compare the standardized rates between these countries.
To show how standardization is done, we will use the country_demographics (country A) and country_demographics_2 (country B) datasets, by age (in 5 year categories) and sex (female, male). We will add our own fictitious mortality data to these datasets. To make the dataset ready for use, we will perform the following steps:
Alternatively, instead of just adding mortality numbers per stratum, you may have a dataset per country (or per group within a country, province, city, or other catchment area) with one row for each death and information on age and sex for each (or a significant proportion) of these deaths. In this case, you can aggregate by age and sex to create a dataset with numbers per stratum, and then add this to the dataset with population numbers per stratum.
We also need a reference population, the standard population. There are several standard populations available, for the purpose of this exercise we will use the world_standard_population_by_sex. The World standard population is based on the populations of 46 countries and was developed in 1960. I found the website of the NHS Schotland quite informative on the European Standard Population, World Standard Population and Scotland Standard Population: https://www.opendata.nhs.scot/dataset/standard-populations
CAUTION: If you have a newer version of R, the dsr package cannot be directly downloaded as it is archived. However, it is still available from the CRAN archive. You can install and use this one.
For non-Mac users:
require(Rtools)
packageurl <- "https://cran.r-project.org/src/contrib/Archive/dsr/dsr_0.2.2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")# Other solution that may work
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="http:/cran.us.r.project.org")For Mac users:
require(devtools)
devtools::install_version("dsr", version="0.2.2", repos="https://mac.R-project.org")Load the packages required for this analysis:
pacman::p_load(rio, # to import data
here, # to locate files
tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
frailtypack, # needed for dsr, for frailty models
dsr,
PHEindicatormethods)# Country A
countryA_demo_data <- rio::import(here::here("data", "country_demographics.csv"))
countryA_demo_data$Country <- "A" # add column name with the name of the country
# Country B
countryB_demo_data <- rio::import(here::here("data", "country_demographics_2.csv"))
countryB_demo_data$Country <- "B" # add column name with the name of the country
# Join data of country A and country B in one object
all_countries <- rbind(countryA_demo_data, countryB_demo_data)
# Reference population
standard_pop_data <- rio::import(here::here("data", "world_standard_population_by_sex.csv"))We need datasets with one row per stratum, while the current all_countries object has males and females listed on the same row.
# Make a data frame for males only and change the column name m
males_countries <- all_countries %>% dplyr::select(Country, age_cat5, m) # make dataframe for males only
males_countries <- males_countries %>% rename(Tot = m) # rename columns
males_countries$Sex <- "Male" # add column containing male sex
# Do the same for females
females_countries <- all_countries %>% dplyr::select(Country, age_cat5, f)
females_countries <- females_countries %>% rename(Tot = f)
females_countries$Sex <- "Female" # add column containing female sex
# Join the rows to finalize the population table with 1 row per stratum
poptable_countries <- rbind(males_countries, females_countries)
poptable_countries <- poptable_countries %>% rename(AgeGroup = age_cat5) # rename column name so it matches the column name of the reference population dataset
# Remove specific string from column values
poptable_countries <- poptable_countries %>% mutate(AgeGroup = gsub("\\+", "", AgeGroup))We currently do not have number of deaths for each of the strata in our poptable_countries dataset, so we will need to add these. For the purpose of this analysis, we will add fictional data. Alternatively,
# Make a vector with number of deaths
mortality_n <- c(224, 257, 251, 245, 334, 245, 154, 189, 334, 342, 565, 432, 543, 432, 245, 543, 234, 354, # for males of country A
34, 37, 51, 145, 434, 120, 100, 143, 307, 354, 463, 639, 706, 232, 275, 543, 234, 274, # for males of country B
194, 254, 232, 214, 316, 224, 163, 167, 354, 354, 463, 574, 493, 295, 175, 380, 177, 392, # for females of country A
54, 24, 32, 154, 276, 254, 123, 164, 254, 354, 453, 654, 435, 354, 165, 432, 287, 395) # for females of country B
# Make dataset including deaths
poptable_countries$Deaths <- mortality_n # add column with number of deaths
# Create factor levels
poptable_countries <- poptable_countries %>% mutate(AgeGroup = factor(AgeGroup,
levels= c("0-4", "5-9", "10-14",
"15-19", "20-24", "25-29",
"30-34", "35-39", "40-44",
"45-49", "50-54", "55-59",
"60-64", "65-69", "70-74",
"75-79", "80-84", "85")),
Sex = factor(Sex, levels=c("Male", "Female")))
# Arrange by Country and AgeGroup
poptable_countries <- poptable_countries %>% arrange(Country, AgeGroup, Sex)CAUTION: NB. If you have few deaths per stratum, use 10-, or 15-year categories, instead of 5-year categories for age, or combine categories
The values of the column AgeGroup from the standard_pop_data contain the word “years” and “plus”, while those of the poptable_country do not. We will have to remove this string to make it match.
# Remove specific string from column values
standard_pop <- standard_pop_data %>% mutate(AgeGroup = gsub("years", "", AgeGroup))
standard_pop <- standard_pop %>% mutate(AgeGroup = gsub("plus", "", AgeGroup))
standard_pop <- standard_pop %>% mutate(AgeGroup = gsub(" ", "", AgeGroup))
# Rename last column with total population numbers, this variable must be named pop
standard_pop <- standard_pop %>% rename(pop = WorldStandardPopulation)
# Create factor levels
standard_pop <- standard_pop %>% mutate(AgeGroup = factor(AgeGroup,
levels= c("0-4", "5-9", "10-14",
"15-19", "20-24", "25-29",
"30-34", "35-39", "40-44",
"45-49", "50-54", "55-59",
"60-64", "65-69", "70-74",
"75-79", "80-84", "85")),
Sex = factor(Sex, levels=c("Male", "Female")))
# Arrange by AgeGroup
standard_pop <- standard_pop %>% arrange(AgeGroup, Sex, pop)
# Add standard_pop to poptables_countries object (we need this in one dataset for **PHEindicatormethods**)
countries_alldata <- left_join(poptable_countries, standard_pop, by=c("AgeGroup", "Sex"))Remember, we made 1) the poptable_countries object, which is a population table with the number of population and number of deaths per stratum per country 2) the standard_pop object, containing the number of population per stratum for our reference population, the World Standard Population.
The dsr package allows you to calculate and compare directly standardized rates (no indirectly standardized rates!).
# Calculate rates per country directly standardized for age and sex
mortality_rate <- dsr::dsr(data=poptable_countries, # specify object containing number of deaths per stratum
event=Deaths, # column containing number of deaths per stratum
fu=Tot, # column containing number of population per stratum
subgroup=Country, # units we would like to compare
AgeGroup, Sex, # characteristics to which we would like to standardize
refdata=standard_pop, # reference population, with numbers in column called pop
method="gamma", # method to calculate 95% CI
sig=0.95, # significance level
mp=100000, # we want rates per 100.000 population
decimals=2) # number of decimals)## Joining, by = c("AgeGroup", "Sex")
# Print table
knitr::kable(mortality_rate) # show mortality rate before and after direct standardization| Subgroup | Numerator | Denominator | Crude Rate (per 1e+05) | 95% LCL (Crude) | 95% UCL (Crude) | Std Rate (per 1e+05) | 95% LCL (Std) | 95% UCL (Std) |
|---|---|---|---|---|---|---|---|---|
| A | 11344 | 86790567 | 13.07 | 12.83 | 13.31 | 23.57 | 23.08 | 24.06 |
| B | 9955 | 52898281 | 18.82 | 18.45 | 19.19 | 19.33 | 18.46 | 20.22 |
Here, we see that while country A had a lower crude mortality rate than country B, it has a higher standardized rate after direct age and sex standardization.
# Calculate RR
mortality_rr <- dsr::dsrr(data=poptable_countries, # specify object containing number of deaths per stratum
event=Deaths, # column containing number of deaths per stratum
fu=Tot, # column containing number of population per stratum
subgroup=Country, # units we would like to compare
AgeGroup, Sex, # characteristics to which we would like to standardize
refdata=standard_pop, # reference population, with numbers in column called pop
refgroup="B", # reference for comparison
estimate="ratio", # type of estimate
sig=0.95, # significance level
mp=100000, # we want rates per 100.000 population
decimals=2) # number of decimals)## Joining, by = c("AgeGroup", "Sex")
# Print table
knitr::kable(mortality_rr) | Comparator | Reference | Std Rate (per 1e+05) | Rate Ratio (RR) | 95% LCL (RR) | 95% UCL (RR) |
|---|---|---|---|---|---|
| A | B | 23.57 | 1.22 | 1.17 | 1.27 |
| B | B | 19.33 | 1.00 | 0.94 | 1.06 |
The standardized mortality rate is 1.22 times higher in country A compared to country B (95% CI 1.17-1.27).
# Calculate RD
mortality_rd <- dsr::dsrr(data=poptable_countries, # specify object containing number of deaths per stratum
event=Deaths, # column containing number of deaths per stratum
fu=Tot, # column containing number of population per stratum
subgroup=Country, # units we would like to compare
AgeGroup, Sex, # characteristics to which we would like to standardize
refdata=standard_pop, # reference population, with numbers in column called pop
refgroup="B", # reference for comparison
estimate="difference", # type of estimate
sig=0.95, # significance level
mp=100000, # we want rates per 100.000 population
decimals=2) # number of decimals)## Joining, by = c("AgeGroup", "Sex")
# Print table
knitr::kable(mortality_rd) | Comparator | Reference | Std Rate (per 1e+05) | Rate Difference (RD) | 95% LCL (RD) | 95% UCL (RD) |
|---|---|---|---|---|---|
| A | B | 23.57 | 4.24 | 3.24 | 5.24 |
| B | B | 19.33 | 0.00 | -1.24 | 1.24 |
Country A has 4.24 additional deaths per 100.000 population (95% CI 3.24-5.24) compared to country A.
Another way of calculating standardized rates is with the PHEindicatormethods package. This package allows you to calculate directly as well as indirectly standardized rates. We need the reference (standard) population as well as the country-specific mortality and population data in one object, which we have made earlier: countries_alldata.
# Calculate rates per country directly standardized for age and sex
mortality_rate_phe <- countries_alldata %>% group_by(Country) %>%
PHEindicatormethods::phe_dsr(Deaths, # observed number of events (column name)
n = Tot, # non-standard pops for category i.e. ageband
stdpop = pop, # standard populations for each stratum
stdpoptype = "field") # standalone vector or field name, for the std populations
# Print table
knitr::kable(mortality_rate_phe)| Country | total_count | total_pop | value | lowercl | uppercl | confidence | statistic | method |
|---|---|---|---|---|---|---|---|---|
| A | 11344 | 86790567 | 23.56685849327109 | 23.08106689236966 | 24.05943770431395 | 95% | dsr per 100000 | Dobson |
| B | 9955 | 52898281 | 19.32549423719546 | 18.45515902513744 | 20.20881745377039 | 95% | dsr per 100000 | Dobson |
TIP: If you would like to see another reproducible example than listed in this Handbook, please go to https://mran.microsoft.com/snapshot/2020-02-12/web/packages/dsr/vignettes/dsr.html.
PHEindicatormethods reference file: https://cran.r-project.org/web/packages/PHEindicatormethods/PHEindicatormethods.pdf
This page will cover methods to calculate and visualize moving averages, for:
To see a moving average for an epicurve, see the page on epicurves (LINK)
Load packages
pacman::p_load(
tidyverse, # for data management and viz
slider, # for calculating moving averages
tidyquant, # for calculating moving averages on-the-fly in ggplot
)##
## Your package installed
## Warning in pacman::p_load(tidyverse, slider, tidyquant, ): Failed to install/load:
Using the package slider to calculate a moving average in a dataframe, prior to any plotting.
In this approach, the moving average is calculated in the dataset prior to plotting:
mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.ggplot(), a geom_line() is added after the histogram, reflecting the moving average.See the helpful online vignette for the slider package
.before = Inf to achieve cumulative averages from the first rowslide() in simple casesslide_index() to designate a date column as an index, so that dates which do not appear in the dataframe are still included in the window
.before, .after TODO.complete TODOFirst we count the number of cases reported each day. Note that count() is appropriate if the data are in a linelist format (one row per case) - if starting with aggregated counts you will need to follow a different approach (e.g. summarize() - see page on Summarizing data).
# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>%
count(date_onset, name = "new_cases") %>% # count cases by date, new column is named "new_cases"
filter(!is.na(date_onset))The new dataset now looks like this:
DT::datatable(ll_counts_7day, rownames = FALSE, options = list(pageLength = 6, scrollX=T) )Next, we create a new column that is the 7-day average. We are using the function slide_index() from slider specifically because we recognize that there are missing days in the above dataframe, and they must be accounted for. To do this, we set a our “index” (.i argument) as the columndate_onset. Sincedate_onsetis a column of class Date, the function recognizes and when calculating it counts the days that do not appear in the dataframe. If you were to use another **slider** function likeslide()`, this indexing would not occur.
Also not that the 7-day window, in this example, is achieved with the argument .before = 6. In this way the window is the day and 6 days preceding. If you want the window to be different (centered or following) use .after in conjunction.
## calculate the average number of cases in the preceding 7 days
ll_counts_7day <- ll_counts_7day %>%
mutate(
avg_7day = slider::slide_index_dbl( # create new column
new_cases, # calculate avg based on value in new_cases column
.i = date_onset, # index column is date_onset, so non-present dates are included in 7day window
.f = ~mean(.x, na.rm = TRUE), # function is mean() with missing values removed
.before = 6, # window is the day and 6-days before
.complete = TRUE)) # fills in first days with NAStep 2 is plotting the 7-day average, in this case shown on top of the underlying daily data.
ggplot(data = ll_counts_7day, aes(x = date_onset)) +
geom_histogram(aes(y = new_cases), fill="#92a8d1", stat = "identity", position = "stack", colour = "#92a8d1")+
geom_line(aes(y = avg_7day), color="red", size = 1) +
scale_x_date(
date_breaks = "1 month",
date_labels = '%d/%m',
expand = c(0,0)) +
scale_y_continuous(expand = c(0,0), limits = c(0, NA)) +
labs(x="", y ="Number of confirmed cases")+
theme_minimal() ## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 1 row(s) containing missing values (geom_path).
TBD - tidyquant
per_pos_plot_county <- ggplot(data = filter(tests_per_county),
aes(x = DtSpecimenCollect_Final, y = prop_pos))+
geom_line(size = 1, alpha = 0.2)+ # plot raw values
tidyquant::geom_ma(n=7, size = 2)+ # plot moving average
theme_minimal_hgrid()+
coord_cartesian(xlim = c(as.Date("2020-03-15"), Sys.Date()), ylim = c(0, 15))+
labs(title = "COUNTY-WIDE TESTING PERCENT POSITIVE",
subtitle = "Daily and 7-day moving average",
y = "Percent Positive",
x = "Date Specimen Collected")+
theme_text_size+
theme(axis.text = element_text(face = "bold", size = 14),
panel.background = element_rect(fill = "khaki")
)See the helpful online vignette for the slider package
If your use case requires that you “skip over” weekends and even holidays, you might like almanac package.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Endemic corridor analysis Detecting spikes in syndromic/routine surveillance
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag .tabset .tabset-fade}
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
There exists a growing body of tools for epidemic modelling that lets us conduct fairly complex analyses with minimal effort. This section will provide an overview on how to use these tools to:
It is not intended as an overview of the methodologies and statistical methods underlying these tools, so please refer to the Resources tab for links to some papers covering this. Make sure you have an understanding of the methods before using these tools; this will ensure you can accurately interpret their results.
Below is an example of one of the outputs we’ll be producing in this section.
We will use two different methods and packages for Rt estimation, namely EpiNow and EpiEstim, as well as the projections package for forecasting case incidence.
pacman::p_load(
rio, # File import
here, # File locator
tidyverse, # Data management + ggplot2 graphics
epicontacts, # Analysing transmission networks
EpiNow2, # Rt estimation
EpiEstim, # Rt estimation
projections, # Incidence projections
incidence, # Handling incidence data
epitrix, # Useful epi functions
distcrete # Discrete delay distributions
)We will use the standard, cleaned linelist for all analyses in this section.
# import the cleaned linelist
linelist <- rio::import("linelist_cleaned.xlsx")The reproduction number R is a measure of the transmissibility of a disease and is defined as the expected number of secondary cases per infected case. In a fully susceptible population, this value represents the basic reproduction number R0. However, as the number of susceptible individuals in a population changes over the course of an outbreak or pandemic, and as various response measures are implemented, the most commonly used measure of transmissibility is the effective reproduction number Rt; this is defined as the expected number of secondary cases per infected case at a given time t.
The EpiNow2 package provides the most sophisticated framework for estimating Rt. It has two key advantages over the other commonly used package, EpiEstim:
However, it also has two key disadvantages:
Which package you choose to use will therefore depend on the data, time and computational resources available to you.
The delay distributions required to run EpiNow2 depend on the data you have. Essentially, you need to be able to describe the delay from the date of infection to the date of the event you want to use to estimate Rt. If you are using dates of onset, this would simply be the incubation period distribution. If you are using dates of reporting, you require the delay from infection to reporting. As this distribution is unlikely to be known directly, EpiNow2 lets you chain multiple delay distributions together; in this case, the delay from infection to symptom onset (e.g. the incubation period, which is likely known) and from symptom onset to reporting (which you can often estimate from the data).
As we have the dates of onset for all our cases in the example linelist, we will only require the incubation period distribution to link our data (e.g. dates of symptom onset) to the date of infection. We can either estimate this distribution from the data or use values from the literature.
A literature estimate of the incubation period of Ebola (taken from this paper) with a mean of 9.1, standard deviation of 7.3 and maximum value of 30 would be specified as follows:
incubation_period_lit <- list(
mean = log(9.1),
mean_sd = log(0.1),
sd = log(7.3),
sd_sd = log(0.1),
max = 30
)Note that EpiNow2 requires these delay distributions to be provided on a log
scale, hence the log call around each value (except the max parameter which,
confusingly, has to be provided on a natural scale). The mean_sd and sd_sd
define the standard deviation of the mean and standard deviation estimates. As
these are not known in this case, we choose the fairly arbitrary value of 0.1.
In this analysis, we instead estimate the incubation period distribution
from the linelist itself using the function bootstrapped_dist_fit, which will
fit a lognormal distribution to the observed delays between infection and onset
in the linelist.
## estimate incubation period
incubation_period <- bootstrapped_dist_fit(
linelist$date_onset - linelist$date_infection,
dist = "lognormal",
max_value = 100,
bootstraps = 1
)The other distribution we require is the generation time. As we have data on
infection times and transmission links, we can estimate this
distribution from the linelist by calculating the delay between infection times
of infector-infectee pairs. To do this, we use the handy get_pairwise function
from the package epicontacts, which allows us to calculate pairwise
differences of linelist properties between transmission pairs. We first create an
epicontacts object (see Transmission chains chapter for further
details):
## generate contacts
contacts <- linelist %>%
transmute(
from = infector,
to = case_id
) %>%
drop_na()
## generate epicontacts object
epic <- make_epicontacts(
linelist = linelist,
contacts = contacts,
directed = TRUE
)We then fit the difference in infection times between transmission pairs,
calculated using get_pairwise, to a gamma distribution:
## estimate gamma generation time
generation_time <- bootstrapped_dist_fit(
get_pairwise(epic, "date_infection"),
dist = "gamma",
max_value = 20,
bootstraps = 1
)Now we just need to calculate daily incidence from the linelist, which we can do
easily with the dplyr functions group_by() and n(). Note
that EpiNow2 requires the column names to be date and confirm.
## get incidence from onset dates
cases <- linelist %>%
group_by(date = date_onset) %>%
summarise(confirm = n())We can then estimate Rt using the epinow function. Some notes on
the inputs:
delays
argument; we would simply insert them alongside the incubation_period object
within the delay_opts function.return_output ensures the output is returned within R and not just saved to
a file.verbose specifies that we want a readout of the progress.horizon indicates how many days we want to project future incidence for.stan argument to specify how how long
we want to run the inference for. Increasing samples and chains will give
you a more accurate estimate that better characterises uncertainty, however
will take longer to run.## run epinow
epinow_res <- epinow(
reported_cases = cases,
generation_time = generation_time,
delays = delay_opts(incubation_period),
return_output = TRUE,
verbose = TRUE,
horizon = 21,
stan = stan_opts(samples = 750, chains = 4)
)Once the code has finished running, we can plot a summary very easily as follows:
## plot summary figure
plot(epinow_res)We can also look at various summary statistics:
## summary table
epinow_res$summary## measure estimate numeric_estimate
## 1: New confirmed cases by infection date 4 (2 -- 6) <data.table[1x9]>
## 2: Expected change in daily cases Unsure 0.5600000000000001
## 3: Effective reproduction no. 0.88 (0.73 -- 1.1) <data.table[1x9]>
## 4: Rate of growth -0.012 (-0.028 -- 0.0052) <data.table[1x9]>
## 5: Doubling/halving time (days) -60 (130 -- -25) <data.table[1x9]>
For further analyses and custom plotting, you can access the summarised daily
estimates via $estimates$summarised. We will convert this from the default
data.table to a tibble for ease of use with dplyr.
## extract summary and convert to tibble
estimates <- as_tibble(epinow_res$estimates$summarised)
estimatesAs an example, let’s make a plot of the doubling time and Rt. We will only look at the first few months of the outbreak when Rt is well above one, to avoid plotting extremely high doublings times.
We use the formula log(2)/growth_rate to calculate the doubling time from the
estimated growth rate.
## make wide df for median plotting
df_wide <- estimates %>%
filter(
variable %in% c("growth_rate", "R"),
date < as.Date("2014-09-01")
) %>%
## convert growth rates to doubling times
mutate(
across(
c(median, lower_90:upper_90),
~ case_when(
variable == "growth_rate" ~ log(2)/.x,
TRUE ~ .x
)
),
## rename variable to reflect transformation
variable = replace(variable, variable == "growth_rate", "doubling_time")
)
## make long df for quantile plotting
df_long <- df_wide %>%
## here we match matching quantiles (e.g. lower_90 to upper_90)
pivot_longer(
lower_90:upper_90,
names_to = c(".value", "quantile"),
names_pattern = "(.+)_(.+)"
)
## make plot
ggplot() +
geom_ribbon(
data = df_long,
aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
color = NA
) +
geom_line(
data = df_wide,
aes(x = date, y = median)
) +
## use label_parsed to allow subscript label
facet_wrap(
~ variable,
ncol = 1,
scales = "free_y",
labeller = as_labeller(c(R = "R[t]", doubling_time = "Doubling~time"), label_parsed),
strip.position = 'left'
) +
## manually define quantile transparency
scale_alpha_manual(
values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
labels = function(x) paste0(x, "%")
) +
labs(
x = NULL,
y = NULL,
alpha = "Credibel\ninterval"
) +
scale_x_date(
date_breaks = "1 month",
date_labels = "%b %d\n%Y"
) +
theme_minimal(base_size = 14) +
theme(
strip.background = element_blank(),
strip.placement = 'outside'
)To run EpiEstim, we need to provide data on daily incidence and specify the serial interval (i.e. the distribution of delays between symptom onset of primary and secondary cases).
Incidence data can be provided as a vector, a dataframe or an incidence
object from the incidence package, and you can even distinguish between imports
and locally acquired infections; see the documentation at ?estimate_R for
further details. We will create an incidence object:
## get incidence from onset date
cases <- incidence(linelist$date_onset)## 241 missing observations were removed.
The package provides several options for specifying the serial interval, the
details of which are provided in the documentation at ?estimate_R. We will
cover two of them here.
Using the option method = "parametric_si", we can manually specify the mean and
standard deviation of the serial interval in a config object created using the
function make_config. We use a mean and standard deviation of 12.0 and 5.2, respectively, defined in
this paper:
## make config
config_lit <- make_config(
mean_si = 12.0,
std_si = 5.2
)We can then estimate Rt with the estimate_R function:
epiestim_res_lit <- estimate_R(
incid = cases,
method = "parametric_si",
config = config_lit
)## Default config will estimate R on weekly sliding windows.
## To change this change the t_start and t_end arguments.
and plot a summary of the outputs:
plot(epiestim_res_lit)As we have data on dates of symptom onset and transmission links, we can
also estimate the serial interval from the linelist by calculating the delay
between onset dates of infector-infectee pairs. As we did in the EpiNow2
section, we will use the get_pairwise function from the epicontacts
package, which allows us to calculate pairwise differences of linelist
properties between transmission pairs. We first create an epicontacts object
(see Transmission chains chapter for further details):
## generate contacts
contacts <- linelist %>%
transmute(
from = infector,
to = case_id
) %>%
drop_na()
## generate epicontacts object
epic <- make_epicontacts(
linelist = linelist,
contacts = contacts,
directed = TRUE
)We then fit the difference in onset dates between transmission pairs, calculated
using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma
from the epitrix package for this fitting procedure, as we require a
discretised distribution.
## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))We then pass this information to the config object, run EpiEstim
again and plot the results:
## make config
config_emp <- make_config(
mean_si = serial_interval$mu,
std_si = serial_interval$sd
)
## run epiestim
epiestim_res_emp <- estimate_R(
incid = cases,
method = "parametric_si",
config = config_emp
)## Default config will estimate R on weekly sliding windows.
## To change this change the t_start and t_end arguments.
## plot outputs
plot(epiestim_res_emp)These default options will provide a weekly sliding estimate and might a warning that you are estimating Rt too early in the outbreak for a precise estimate. You can change this by setting a later start date for the estimation as shown below. Unfortunately, EpiEstim only provides a very clunky way of specifying these estimations times, in that you have to provide a vector of integers referring to the start and end dates for each time window.
## define a vector of dates starting on June 1st
start_dates <- seq.Date(
as.Date("2014-06-01"),
max(cases$dates) - 7,
by = 1
) %>%
## subtract the starting date to convert to numeric
`-`(min(cases$dates)) %>%
## convert to integer
as.integer()
## add six days for a one week sliding window
end_dates <- start_dates + 6
## make config
config_partial <- make_config(
mean_si = 12.0,
std_si = 5.2,
t_start = start_dates,
t_end = end_dates
)Now we re-run EpiEstim and can see that the estimates only start from June:
## run epiestim
epiestim_res_partial <- estimate_R(
incid = cases,
method = "parametric_si",
config = config_partial
)
## plot outputs
plot(epiestim_res_partial)The main outputs can be access via $R. As an example, we will create a plot of
Rt and a measure of “transmission potential” given by the product of
Rt and the number of cases reported on that day; this represents the
expected number of cases in the next generation of infection.
## make wide dataframe for median
df_wide <- epiestim_res_lit$R %>%
rename_all(clean_labels) %>%
rename(
lower_95_r = quantile_0_025_r,
lower_90_r = quantile_0_05_r,
lower_50_r = quantile_0_25_r,
upper_50_r = quantile_0_75_r,
upper_90_r = quantile_0_95_r,
upper_95_r = quantile_0_975_r,
) %>%
mutate(
## extract the median date from t_start and t_end
dates = epiestim_res_emp$dates[round(map2_dbl(t_start, t_end, median))],
var = "R[t]"
) %>%
## merge in daily incidence data
left_join(as.data.frame(cases), "dates") %>%
## calculate risk across all r estimates
mutate(
across(
lower_95_r:upper_95_r,
~ .x*counts,
.names = "{str_replace(.col, '_r', '_risk')}"
)
) %>%
## seperate r estimates and risk estimates
pivot_longer(
contains("median"),
names_to = c(".value", "variable"),
names_pattern = "(.+)_(.+)"
) %>%
## assign factor levels
mutate(variable = factor(variable, c("risk", "r")))
## make long dataframe from quantiles
df_long <- df_wide %>%
select(-variable, -median) %>%
## seperate r/risk estimates and quantile levels
pivot_longer(
contains(c("lower", "upper")),
names_to = c(".value", "quantile", "variable"),
names_pattern = "(.+)_(.+)_(.+)"
) %>%
mutate(variable = factor(variable, c("risk", "r")))
## make plot
ggplot() +
geom_ribbon(
data = df_long,
aes(x = dates, ymin = lower, ymax = upper, alpha = quantile),
color = NA
) +
geom_line(
data = df_wide,
aes(x = dates, y = median),
alpha = 0.2
) +
## use label_parsed to allow subscript label
facet_wrap(
~ variable,
ncol = 1,
scales = "free_y",
labeller = as_labeller(c(r = "R[t]", risk = "Transmission~potential"), label_parsed),
strip.position = 'left'
) +
## manually define quantile transparency
scale_alpha_manual(
values = c(`50` = 0.7, `90` = 0.4, `95` = 0.2),
labels = function(x) paste0(x, "%")
) +
labs(
x = NULL,
y = NULL,
alpha = "Credible\ninterval"
) +
scale_x_date(
date_breaks = "1 month",
date_labels = "%b %d\n%Y"
) +
theme_minimal(base_size = 14) +
theme(
strip.background = element_blank(),
strip.placement = 'outside'
)Besides estimating Rt, EpiNow2 also supports forecasting of
Rt and projections of case numbers by integration with the
EpiSoon package under the hood. All you need to do is specify the horizon
argument in your epinow function call, indicating how many days you want to
project into the future; see the EpiNow2 section under the “Estimating
Rt” for details on how to get EpiNow2 up and running. In this
section, we will just plot the outputs from that analysis, stored in the
epinow_res object.
## define minimum date for plot
min_date <- as.Date("2015-03-01")
## extract summarised estimates
estimates <- as_tibble(epinow_res$estimates$summarised)
## extract raw data on case incidence
observations <- as_tibble(epinow_res$estimates$observations) %>%
filter(date > min_date)
## extract forecasted estimates of case numbers
df_wide <- estimates %>%
filter(
variable == "reported_cases",
type == "forecast",
date > min_date
)
## convert to even longer format for quantile plotting
df_long <- df_wide %>%
## here we match matching quantiles (e.g. lower_90 to upper_90)
pivot_longer(
lower_90:upper_90,
names_to = c(".value", "quantile"),
names_pattern = "(.+)_(.+)"
)
## make plot
ggplot() +
geom_histogram(
data = observations,
aes(x = date, y = confirm),
stat = 'identity',
binwidth = 1
) +
geom_ribbon(
data = df_long,
aes(x = date, ymin = lower, ymax = upper, alpha = quantile),
color = NA
) +
geom_line(
data = df_wide,
aes(x = date, y = median)
) +
geom_vline(xintercept = min(df_long$date), linetype = 2) +
## manually define quantile transparency
scale_alpha_manual(
values = c(`20` = 0.7, `50` = 0.4, `90` = 0.2),
labels = function(x) paste0(x, "%")
) +
labs(
x = NULL,
y = "Daily reported cases",
alpha = "Credible\ninterval"
) +
scale_x_date(
date_breaks = "1 month",
date_labels = "%b %d\n%Y"
) +
theme_minimal(base_size = 14)The projections package developed by RECON makes it very easy to make short term incidence forecasts, requiring only knowledge of the effective reproduction number Rt and the serial interval. Here we will cover how to use serial interval estimates from the literature and how to use our own estimates them the linelist.
projections requires a discretised serial interval distribution of the class
distcrete from the package distcrete. We will use a gamma distribution
with a mean of 12.0 and and standard deviation of 5.2 defined in
this paper. To
convert these values into the shape and scale parameters required for a gamma
distribution, we will use the function gamma_mucv2shapescale from the
epitrix package.
## get shape and scale parameters from the mean mu and the coefficient of
## variation (e.g. the ratio of the standard deviation to the mean)
shapescale <- epitrix::gamma_mucv2shapescale(mu = 12.0, cv = 5.2/12)
## make distcrete object
serial_interval_lit <- distcrete::distcrete(
name = "gamma",
interval = 1,
shape = shapescale$shape,
scale = shapescale$scale
)Here a quick check to make sure the serial interval looks correct. We
access the density of the gamma distribution we have just defined by $d, which
is equivalent to calling dgamma:
## check to make sure the serial interval looks correct
qplot(
x = 0:50, y = serial_interval_lit$d(0:50), geom = "area",
xlab = "Serial interval", ylab = "Density"
)As we have data on dates of symptom onset and transmission links, we can
also estimate the serial interval from the linelist by calculating the delay
between onset dates of infector-infectee pairs. As we did in the EpiNow2
section, we will use the get_pairwise function from the epicontacts
package, which allows us to calculate pairwise differences of linelist
properties between transmission pairs. We first create an epicontacts object
(see Transmission chains chapter for further details):
## generate contacts
contacts <- linelist %>%
transmute(
from = infector,
to = case_id
) %>%
drop_na()
## generate epicontacts object
epic <- make_epicontacts(
linelist = linelist,
contacts = contacts,
directed = TRUE
)We then fit the difference in onset dates between transmission pairs, calculated
using get_pairwise, to a gamma distribution. We use the handy fit_disc_gamma
from the epitrix package for this fitting procedure, as we require a
discretised distribution.
## estimate gamma serial interval
serial_interval <- fit_disc_gamma(get_pairwise(epic, "date_onset"))
## inspect estimate
serial_interval[c("mu", "sd")]## $mu
## [1] 11.40901674601116
##
## $sd
## [1] 7.634986492457463
To project future incidence, we still need to provide historical incidence in
the form of an incidence object, as well as a sample of plausible
Rt values. We will generate these values using the Rt
estimates generated by EpiEstim in the previous section (under “Estimating
Rt”) and stored in the epiestim_res_emp object. In the code below,
we extract the mean and standard deviation estimates of Rt for the
last time window of the outbreak (using the tail function to access the last
element in a vector), and simulate 1000 values from a gamma distribution using
rgamma. You can also provide your own vector of Rt values that you
want to use for forward projections.
## create incidence object from dates of onset
inc <- incidence::incidence(linelist$date_onset)## 241 missing observations were removed.
## extract plausible r values from most recent estimate
mean_r <- tail(epiestim_res_emp$R$`Mean(R)`, 1)
sd_r <- tail(epiestim_res_emp$R$`Std(R)`, 1)
shapescale <- gamma_mucv2shapescale(mu = mean_r, cv = sd_r/mean_r)
plausible_r <- rgamma(1000, shape = shapescale$shape, scale = shapescale$scale)
## check distribution
qplot(x = plausible_r, geom = "histogram", xlab = expression(R[t]), ylab = "Counts")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We then use the project function to make the actual forecast. We specify how
many days we want to project for via the n_days arguments, and specify the
number of simulations using the n_sim argument.
## make projection
proj <- project(
x = inc,
R = plausible_r,
si = serial_interval$distribution,
n_days = 21,
n_sim = 1000
)We can then handily plot the incidence and projections using the plot and
add_projections functions. We can easily subset the incidence object to only
show the most recent cases by using the square bracket operator.
## plot incidence and projections
plot(inc[inc$dates > as.Date("2015-03-01")]) %>%
add_projections(proj)## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
You can also easily extract the raw estimates of daily case numbers by converting the output to a dataframe.
## convert to data frame for raw data
proj_df <- as.data.frame(proj)
proj_dfUNDER CONSTRUCTION
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Tidymodels
Liza Coyer TODO this? logitudinal data
UNDER CONSTRUCTION
UNDER CONSTRUCTION
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Spatial aspects of your data can provide a lot of insights into the situation of the outbreak to answer questions such as:
In this section, we will explore basic spatial data visualization methods using tmap and ggplot2 packages. We will also walk through some of the basic spatial data management and querying methods with the sf package.
Choropleth map
Density heatmap
Health facility catchment area
Load packages
First, load the packages required for this analysis:
pacman::p_load(rio, # to import data
here, # to locate files
tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
sf, # to manage spatial data using a Simple Feature format
tmap,# to produce simple maps, works for both interactive and static maps
janitor, # to clean column names
OpenStreetMap # to add OSM basemap in ggplot map
) Sample case data
# import aggregated case counts of disease X
linelist <- rio::import(here::here("data", "linelist_cleaned.rds"))
linelist <- linelist[sample(nrow(linelist), 1000),]
# Create sf object
linelist_sf <-
linelist %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326)Sierra Leone: Admin boundary shapefiles
Data downloaded from HDX:
https://data.humdata.org/dataset/sierra-leone-all-ad-min-level-boundaries
# ADM3 level
sle_adm3 <-
sf::read_sf(here::here("data/shp", "sle_adm3.shp")) %>% janitor::clean_names() %>%
filter(admin2name %in% c("Western Area Urban", "Western Area Rural"))Sierra Leone: Population by ADM3
Data downloaded from HDX:
https://data.humdata.org/dataset/sierra-leone-population
# Population by ADM3
sle_adm3_pop <-
read.csv(here::here("data/population", "sle_admpop_adm3_2020.csv")) %>% janitor::clean_names()Sierra Leone: Health facility data from OpenStreetMap
Data downloaded from HDX:
https://data.humdata.org/dataset/hotosm_sierra_leone_health_facilities
# OSM health facility shapefile
sle_hf <-
sf::read_sf(here::here("data/shp", "sle_hf.shp")) %>%
janitor::clean_names() %>%
filter(amenity %in% c("hospital", "clinic", "doctors"))The easiest way to plot the XY coordinates (points) is to draw a map directly from the sf object which we created in the preparation section.
tmap offers simple mapping capabilities for both static (plot mode) and interactive (view mode) with just a few lines of codes.
This blog provides a good comparison among different mapping options in R. https://rstudio-pubs-static.s3.amazonaws.com/324400_69a673183ba449e9af4011b1eeb456b9.html
tmap_mode("plot") # or "plot"## tmap mode set to plotting
#tm_shape(sle_adm3, bbox = st_bbox(linelist_sf)) +
tm_shape(sle_adm3, bbox = c(-13.3,8.43, -13.2,8.5)) +
tm_polygons(col = "#F7F7F7") +
tm_borders(col = "#000000", lwd = 2) +
tm_text("admin3name") +
tm_shape(linelist_sf) + tm_dots(size=0.08, col='blue') ## Warning: One tm layer group has duplicated layer types, which are omitted. To draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
Choropleth maps can be useful to visualize your data by pre-defined area usually by administrative unit or health area for outbreak response to be able to target resources for specific area high incidence rates for example.
The current linelist data does not contain any information about the administrative units. Although it is ideal to store such information during the initial data collection phase, we can also assign administrative units to individual cases based on their spatial relationships (i.e. point intersects with a polygon).
sf package offers various methods for spatial joins. See more documentation about the st_join method and spatial join types here: https://r-spatial.github.io/sf/reference/geos_binary_pred.html
Spatial assign administrative units to cases First spatially intersect our case locations (points) with the ADM3 boundaries (polygons)
linelist_adm <-
linelist_sf %>%
sf::st_join(sle_adm3, join = st_intersects) %>%
select(names(linelist_sf), admin3name, admin3pcod)## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
# Now you will see the ADM3 names attached to each case
linelist_adm %>% select(case_id, admin3name)## Simple feature collection with 1000 features and 2 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -13.2727552092938 ymin: 8.44838662875464 xmax: -13.2052242314884 ymax: 8.49174756781319
## geographic CRS: WGS 84
## First 10 features:
## case_id admin3name geometry
## 5443 631102 Mountain Rural POINT (-13.2172679238128 8....
## 4493 a8a470 West II POINT (-13.2489387103962 8....
## 857 3e6e2d Mountain Rural POINT (-13.2125371941858 8....
## 4046 391c7c East II POINT (-13.2145130684954 8....
## 2930 df9d8c West II POINT (-13.2467367509832 8....
## 3104 1ebd20 East II POINT (-13.2118140159416 8....
## 141 1cc5e0 East II POINT (-13.2186890009113 8....
## 1059 9db018 West I POINT (-13.2482279575881 8....
## 142 8d81ff West III POINT (-13.2636371716871 8....
## 3365 46037e Mountain Rural POINT (-13.2180562995949 8....
Case counts by ADM3
case_adm3 <-
linelist_adm %>% as_tibble() %>%
#filter(!is.na(admin3pcod)) %>%
group_by(admin3pcod, admin3name) %>%
summarise(cases = n()) %>%
arrange(desc(cases))## `summarise()` has grouped output by 'admin3pcod'. You can override using the `.groups` argument.
case_adm3## # A tibble: 10 x 3
## # Groups: admin3pcod [10]
## admin3pcod admin3name cases
## <chr> <chr> <int>
## 1 SL040102 Mountain Rural 276
## 2 SL040208 West III 242
## 3 SL040207 West II 170
## 4 SL040204 East II 112
## 5 SL040201 Central I 55
## 6 SL040203 East I 47
## 7 SL040206 West I 40
## 8 SL040202 Central II 32
## 9 SL040205 East III 21
## 10 <NA> <NA> 5
Choropleth mapping Now that we have the administrative unit names assigned to all cases, we can start mapping the case counts by area (choropleth maps).
Since we also have population data by ADM3, we can add this information to the case_adm3 table created previously.
# Add population data and calculate cases per 10K population
case_adm3 <-
case_adm3 %>%
left_join(sle_adm3_pop, by=c("admin3pcod"="adm3_pcode")) %>%
select(names(case_adm3), total) %>%
mutate(case_10kpop = round(cases/total * 10000, 3))
case_adm3## # A tibble: 10 x 5
## # Groups: admin3pcod [10]
## admin3pcod admin3name cases total case_10kpop
## <chr> <chr> <int> <int> <dbl>
## 1 SL040102 Mountain Rural 276 33993 81.2
## 2 SL040208 West III 242 210252 11.5
## 3 SL040207 West II 170 145109 11.7
## 4 SL040204 East II 112 99821 11.2
## 5 SL040201 Central I 55 69683 7.89
## 6 SL040203 East I 47 68284 6.88
## 7 SL040206 West I 40 60186 6.65
## 8 SL040202 Central II 32 23874 13.4
## 9 SL040205 East III 21 500134 0.42
## 10 <NA> <NA> 5 NA NA
Join this table with the ADM3 polygons for mapping
# Add population data and calculate cases per 10K population
case_adm3_sf <-
case_adm3 %>%
left_join(sle_adm3, by="admin3pcod") %>%
select(objectid, admin3pcod, admin3name=admin3name.x, admin2name, admin1name,
cases, total, case_10kpop, geometry) %>%
st_as_sf()Mapping the results
# Number of cases
tmap_mode("plot")## tmap mode set to plotting
tm_shape(case_adm3_sf) +
tm_polygons("cases") +
tm_text("admin3name")## Warning: The shape case_adm3_sf contains empty units.
# Cases per 10K population
tmap_mode("plot")## tmap mode set to plotting
tm_shape(case_adm3_sf) +
tm_polygons("case_10kpop",
breaks=c(0, 10, 50, 100),
palette = "Purples"
) +
tm_text("admin3name")## Warning: The shape case_adm3_sf contains empty units.
We can also look at the combination of time and space by facetting the heatmaps.
Set parameters for the basemap using the OpenStreetMap package.
# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)), # limits of tile
c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
zoom = NULL,
type = c("osm", "stamen-toner", "stamen-terrain","stamen-watercolor", "esri","esri-topo")[1],
mergeTiles = TRUE)
# Projection WGS84
map.latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")Heatmap by month of onset
# Extract month of onset
linelist$date_onset_ym <- format(linelist$date_onset, "%Y-%m")
# Simply facet above map by month of onset
# Plot map. Must be autoplotted to work with ggplot
OpenStreetMap::autoplot.OpenStreetMap(map.latlon)+
# Density tiles
ggplot2::stat_density_2d(aes(x = lon,
y = lat,
fill = ..level..,
alpha=..level..),
bins = 10,
geom = "polygon",
contour_var = "count",
data = linelist %>% filter(date_onset>='2014-08-01' & date_onset<='2015-01-31'),
show.legend = F) +
#scale_fill_gradient(low = "black", high = "red")+
labs(x = "Longitude",
y = "Latitude",
title = "Distribution of simulated cases by month of onset") +
facet_wrap(~ date_onset_ym, ncol = 3)It might be useful to know where the health facilities are located in relation to the disease hot spots.
Finding the nearest health facility We can use the st_nearest_feature method from the sf package to assign the cloest health facility to individual cases.
# Closet health facility to each case
linelist_sf_hf <-
linelist_sf %>%
st_join(sle_hf, join = st_nearest_feature) %>%
select(case_id, osm_id, name, amenity)## although coordinates are longitude/latitude, st_nearest_points assumes that they are planar
We can see that “Den Clinic” is the closest health facility for about ~30% of the cases.
# Group cases by health facility
hf_catchment <-
linelist_sf_hf %>% as.data.frame() %>%
group_by(name) %>%
summarise(case_n = n()) %>%
arrange(desc(case_n))
hf_catchment## # A tibble: 8 x 2
## name case_n
## <chr> <int>
## 1 Den Clinic 382
## 2 Shriners Hospitals for Children 314
## 3 GINER HALL COMMUNITY HOSPITAL 178
## 4 panasonic 47
## 5 Princess Christian Maternity Hospital 28
## 6 ARAB EGYPT CLINIC 20
## 7 <NA> 18
## 8 MABELL HEALTH CENTER 13
Visualizing the results on the map
tmap_mode("view")## tmap mode set to interactive viewing
tm_shape(linelist_sf_hf) + tm_dots(size=0.08, col='name') +
tm_shape(sle_hf) + tm_dots(size=0.3, col='red') + tm_text("name") +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))Cases within 30 mins Walking distance from the closest health facility
We can also explore how many cases are located within 2.5km (~30 mins) walking distance from the closest health facility.
Note: For more accurate distance calculations, it is better to re-project your sf object to the respective local map projection system such as UTM (Earth projected onto a planar surface). In this example, for simplicity we will stick to the World Geodetic System (WGS84) Geograhpic coordinate system (Earth represented in a spherical / round surface, therefore the units are in decimal degrees). We will use a general conversion of: 1 decimal degree = ~111km.
See more information about map projections and coordinate systems: https://www.esri.com/arcgis-blog/products/arcgis-pro/mapping/gcs_vs_pcs/
First create a circular buffer with a radius of ~2.5km aroudn each health facility
sle_hf_2k <-
sle_hf %>%
st_buffer(dist=0.02) # approximately 2.5km ## Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle = endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
## dist is assumed to be in decimal degrees (arc_degrees).
Intersect this with the cases
# Intersect the cases with the buffers
linelist_sf_hf_2k <-
linelist_sf_hf %>%
st_join(sle_hf_2k, join = st_intersects, left = TRUE) %>%
filter(osm_id.x==osm_id.y | is.na(osm_id.y)) %>%
select(case_id, osm_id.x, name.x, amenity.x, osm_id.y)## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
Count the results
202 out of 1000 cases (20.2%, shown in red dots in the map below) live more than 30 mins away from the nearest health facility)
nrow(linelist_sf_hf_2k)## [1] 1000
nrow(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),])## [1] 214
Visualize the results
tmap_mode("view")## tmap mode set to interactive viewing
tm_shape(linelist_sf_hf) + tm_dots(size=0.08, col='name') +
tm_shape(sle_hf_2k) + tm_borders(col = "red", lwd = 2) +
tm_shape(linelist_sf_hf_2k[is.na(linelist_sf_hf_2k$osm_id.y),]) +tm_dots(size=0.1, col='red') +
tm_view(set.view = c(-13.2284,8.4699, 13), set.zoom.limits = c(13,14))R Simple Features and sf package https://cran.r-project.org/web/packages/sf/vignettes/sf1.html
R tmap package https://cran.r-project.org/web/packages/tmap/vignettes/tmap-getstarted.html
ggmap: Spatial Visualization with ggplot2 https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
UNDER CONSTRUCTION
https://www.tidyverse.org/blog/2018/07/ggplot2-3-0-0/
Embed ggplot cheatsheet
Keep the title of this section as “Preparation”.
Data preparation steps such as:
highlighting one line among many etc gghighlight
http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/#modifying-facet-label-text
labellers
https://ggplot2.tidyverse.org/reference/labellers.html
facet_wrap vs. facet_grid
Using option label_wrap_gen in facet_wrap to have multiple strip lines
Cowplot Complicated method (% 100 * …)
ggrepel
This code chunk shows the loading of packages required for the analyses.
pacman::p_load(rio, # File import
here, # File locator
lubridate, # working with dates
aweek, # alternative package for working with dates
incidence, # an option for epicurves of linelist data
stringr, # Search and manipulate character strings
forcats, # working with factors
RColorBrewer, # Color palettes from colorbrewer2.org
tidyverse, # data management + ggplot2 graphics
) ##
## Your package installed
## Warning in pacman::p_load(rio, here, lubridate, aweek, incidence, stringr, : Failed to install/load:
Two example datasets are used in this section:
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.
For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.
# fake import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")Review the two datasets and notice the differences
Case linelist
The first 50 rows are displayed
Case counts aggregated by hospital
The first 50 rows are displayed
You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”).
You can then reference the data_date in the code when applying filters or in captions that auto-update.
## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")Verify that each relevant date column is class Date and has an appropriate range of values. This for loop prints a histogram for each column.
# create character vector of column names
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
# Produce histogram of each date column
for (Col in DateCols) { # open loop. iterate for each name in vector DateCols
hist(linelist[, Col], # print histogram of the column in linelist dataframe
breaks = 50, # number of breaks for the histogram
xlab = Col) # x-axis label is the name of the column
} # close the loopincidence packageBelow are tabs on making quick epicurves using the incidence package
CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.
TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.
2 steps are requires to plot an epicurve with the incidence package:
incidence())
A simple example - an epicurve of daily cases:
# load incidence package
library(incidence)
# create the incidence object using data by day
epi_day <- incidence(linelist$date_onset, # the linelist data
interval = "day") # the time interval## 241 missing observations were removed.
# plot the incidence object
plot(epi_day)Change time interval of case aggregation (bars)
The interval argument defines how the observations are grouped. Available options include all the options from the package aweek, including but not limited to:
Below are examples of how different intervals look when applied to the linelist.
Format and frequency of the date labels on the x-axis are the defaults for the specified interval.
# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk <- incidence(linelist$date_onset, interval = "Monday week")## 241 missing observations were removed.
# Sunday week
epi_Sun_wk <- incidence(linelist$date_onset, interval = "Sunday week")## 241 missing observations were removed.
# Three weeks (Monday weeks by default)
epi_3wk <- incidence(linelist$date_onset, interval = "3 weeks")## 241 missing observations were removed.
# Monthly
epi_month <- incidence(linelist$date_onset, interval = "month")## 241 missing observations were removed.
# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+ labs(title = "Monday weeks")
plot(epi_Sun_wk)+ labs(title = "Sunday weeks")
plot(epi_3wk)+ labs(title = "Every 3 Monday weeks")
plot(epi_month)+ labs(title = "Months")The incidence package enables modifications in the following ways:
plot() (e.g. show_cases, col_pal, alpha…)scale_x_incidence() and make_labels()ggplot() additions via the + operatorRead details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.
plot() modificationsA incidence plot can be modified in the following ways. Type ?plot.incidence in the R console for more details.
show_cases = If TRUE, each case is shows as a box. Best on smaller outbreaks.color = Color of case bars/boxesborder = Color of line around boxes, if show_cases = TRUEalpha = Transparency of case bars/boxes (1 is fully opaque, 0 is fully transparent)xlab = Title of x-axis (axis labels can also be applied using labs() from ggplot)ylab = Title of y-axis; defaults to user-defined incidence time intervallabels_week = Logical, indicate whether x-axis labels are in week or date format, absent other modificationsn_breaks = Number of x-axis label breaks, absent other modificationsfirst_date, last_date Dates used to trim the plotSee examples of these arguments in the subsequent tabs.
To plot the epicurve of a subset of data:
incidence() commandThe example below uses data filtered to show only cases at Central Hospital.
# filter the dataset
central_data <- linelist %>%
filter(hospital == "Central Hospital")
# create incidence object using subset of data
central_outbreak <- incidence(central_data$date_onset, interval = "week")## 26 missing observations were removed.
# plot
plot(central_outbreak) + labs(title = "Weekly case incidence at Central Hospital")TIP: Remember that date-axis labels are independent from the aggregation of the data into bars
Modify the bars
The aggregation of data into bars occurs when you set the interval = when creating the incidence object. The options for interval come from the package aweek and include options like “day”, “Monday week”, “Sunday week”, “month”, “2 weeks”, etc. See the incidence intro tab for more information.
Modify date-axis labels (frequency & format)
If working with the incidence package, you have several options to make these modifications. Some utilize the incidence package functions scale_x_date() and make_breaks(), others use the ggplot2 function scale_x_date(), and others use a combination.
DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.
scale_x_incidence() onlyscale_x_incidence() from the incidence package:
interval (e.g. Sundays or Mondays)n_breaks specify number of date labels, which start from the interval of the first case.
n_breaks = nrow(i)/n (“i” is the incidence object name and “n” is a number)labels_week labels formatted as either weeks (YYYY-Www) or dates (YYYY-MM-DD)Other notes:
?scale_x_incidence into the R console to see more information.scale_x_date() to the plot will remove labels created by scale_x_incidence# create weekly incidence object (Sunday weeks)
i <- incidence(central_data$date_onset, interval = "Sunday week")## 26 missing observations were removed.
plot(i)+
scale_x_incidence(i, # name of incidence object
labels_week = F, # show dates instead of weeks
n_breaks = nrow(i)/8) # breaks every 8 weeks from week of first case## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
scale_x_date() and make_breaks()scale_x_date() from ggplot2, but also leverage make_breaks() from incidence:
make_breaks() to define date label breaks
make_breaks() is similar to scale_x_incidence() (described above). Provide the incidence object name and optionally n_breaks as described before.scale_x_date() to the plot:
breaks = provide the breaks vector you created with make_breaks(), followed by $breaks (see example below)date_labels = provide a format for the date labels (e.g. “%d %b”) (use “” for new line)# Break modification using scale_x_date() and make_breaks()
###########################################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")## 26 missing observations were removed.
# make breaks
i_labels <- make_breaks(i, n_breaks = nrow(i)/6) # using interval from i, breaks every 6 weeks
# plot
plot(i)+
scale_x_date(breaks = i_labels$breaks, # call the breaks
date_labels = "%d\n%b '%y", # date format
date_minor_breaks = "weeks") # gridlines each week (aligns with Sundays only) ## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
scale_x_date() onlyscale_x_date() only
date_breaks = (e.g. “day”, “week”, “2 weeks”, “month”, “year”)date_minor_breaks = for vertical lines between date labelsbreaks = and to minor_breaks =date_labels = for formatting (see Dates page for tips)expand = c(0,0) to start labels at the first incidence bar. Otherwise, first label will shift depending on your specified label interval.*Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.
# Break modification using scale_x_date() only
##############################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")## 26 missing observations were removed.
# plot
plot(i)+
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y") # date labels format ## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
If you want a plot of Sunday weeks and also finely-adjusted label formats, you might find a code example helpful.
Here is an example of producing a weekly epicurve using incidence for Sunday weeks, with finely-adjusted date labels through scale_x_date():
# load packages
pacman::p_load(tidyverse, # for ggplot
incidence, # for epicurve
lubridate) # for floor_date() and ceiling_date()
# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset, interval = "Sunday week") # equivalent to "MMWRweek" (see US CDC)## 26 missing observations were removed.
# plot() the incidence object
plot(central_outbreak)+
### ggplot() commands added to the plot
# scale modifications
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
# sequence by 3 weeks, from Sunday before first case to Sunday after last case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
# sequence by week, from Sunday before first case to Sunday after last case
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
# date labels
date_labels = "%d\n%b'%y")+ # adjust how dates are displayed
scale_y_continuous(
expand = c(0,0), # remove excess space under x-axis
breaks = seq(0, 30, 5))+ # adjust y-axis intervals
# Aesthetic themes
theme_minimal()+ # simplify background
theme(
axis.title = element_text(size = 12, face = "bold"), # axis titles formatting
plot.caption = element_text(face = "italic", hjust = 0))+ # caption formatting, left-aligned
# Plot labels
labs(x = "Week of symptom onset (Sunday weeks)",
y = "Weekly case incidence",
title = "Weekly case incidence at Central Hospital",
#subtitle = "",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
To show boxes around each individual case, use the argument show_cases = TRUE in the plot() function.
Boxes around each case can be more reader-friendly, if the outbreak is of a small size. Boxes can be applied when the interval is days, weeks, or any other time period. The code below creates the weekly epicurve for a smaller outbreak (only cases from Central Hospital), with boxes around each case.
# create filtered dataset for Central Hospital
central_data <- linelist %>%
filter(hospital == "Central Hospital")
# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")## 26 missing observations were removed.
# plot outbreak
plot(central_outbreak,
show_cases = T) # show boxes around individual casesThe same epicurve showing individual cases, but with other aesthetic modifications:
# add plot() arguments and ggplot() commands
plot(central_outbreak,
show_cases = T, # show boxes around each individual case
color = "lightblue", # color inside boxes
border = "darkblue", # color of border around boxes
alpha = 0.5)+ # transparency
### ggplot() commands added to the plot
# scale modifications
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "4 weeks", # labels appear every 4 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b'%y")+ # date labels format
scale_y_continuous(
expand = c(0,0), # remove excess space under x-axis
breaks = seq(0, 35, 5))+ # adjust y-axis intervals
# aesthetic themes
theme_minimal()+ # simplify background
theme(
axis.title = element_text(size = 12, face = "bold"), # axis title format
plot.caption = element_text(face = "italic", hjust = 0))+ # caption format and left-align
# plot labels
labs(x = "Week of symptom onset (Monday weeks)",
y = "Weekly reported cases",
title = "Weekly case incidence at Central Hospital",
#subtitle = "",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
To color the cases by value, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their age category. Note the use of incidence() argument na_as_group =. If TRUE (by default) missing values (NA) will form their own group.
# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset, # date of onset for x-axis
interval = "week", # weekly aggregation of cases
groups = linelist$age_cat, # color by age_cat value
na_as_group = TRUE) # missing values assigned their own group## 241 missing observations were removed.
# plot the epicurve
plot(age_outbreak) Adjusting order
To adjust the order of group appearance (on plot and in legend), the group column must be class Factor. Adjust the order by adjusting the order of the levels (including NA). Below is an example with gender groups using data from Central Hospital only.
NA first, so it appears on the top of the barsexclude = NULL in factor() is necessary to adjust the order of NA, which is excluded by default.fill = in labs()You can read more about factors in their page (LINK)
# Create incidence object, data grouped by gender
#################################################
# Classify "gender" column as factor
####################################
# with specific level order and labels, includin for missing values
central_data <- linelist %>%
filter(hospital == "Central Hospital") %>%
mutate(gender = factor(gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
# Create incidence object, by gender
####################################
gender_outbreak_central <- incidence(central_data$date_onset,
interval = "week",
groups = central_data$gender,
na_as_group = TRUE) # Missing values assigned their own group## 26 missing observations were removed.
# plot epicurve with modifications
##################################
plot(gender_outbreak_central,
show_cases = TRUE)+ # show box around each case
### ggplot commands added to plot
# scale modifications
scale_x_date(expand = c(0,0),
date_breaks = "6 weeks",
date_minor_breaks = "week",
date_labels = "%d %b\n%Y")+
# aesthetic themes
theme_minimal()+ # simplify plot background
theme(
legend.title = element_text(size = 14, face = "bold"),
axis.title = element_text(face = "bold"))+ # axis title bold
# plot labels
labs(fill = "Gender", # title of legend
title = "Show case boxes, with modifications",
y = "Weekly case incidence",
x = "Week of symptom onset") ## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
To change the legend
Use ggplot() commands such as:
theme(legend.position = "top") (or “bottom”, “left”, “right”)theme(legend.direction = "horizontal")theme(legend.title = element_blank()) to have no titleSee the page of ggplot() tips for more details on legends.
To specify colors manually, provide the name of the color or a character vector of multiple colors to the argument color =. Note to function properly the number of colors listed must equal the number of groups (be aware of missing values as a group)
# weekly outbreak by hospital
hosp_outbreak <- incidence(linelist$date_onset,
interval = "week",
groups = linelist$hospital,
na_as_group = FALSE) # Missing values not assigned their own group## 241 missing observations were removed.
# default colors
plot(hosp_outbreak)
# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))To change the color palette
Use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).
Other palettes include TO DO add page with palette names… To DO
# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset, # date of onset for x-axis
interval = "week", # weekly aggregation of cases
groups = linelist$age_cat, # color by age_cat value
na_as_group = TRUE) # missing values assigned their own group## 241 missing observations were removed.
# plot the epicurve
plot(age_outbreak)
# plot with different color palette
plot(age_outbreak, col_pal = rainbow)To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()
ggplot()Below are tabs on using the ggplot2 package to produce epicurves from a linelist dataset.
Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to many headaches.
These tabs use a subset of the linelist dataset - only the cases from Central Hospital.
central_data <- linelist %>%
filter(hospital == "Central Hospital")detach("package:tidyverse", unload=TRUE)
library(tidyverse)To produce an epicurve with ggplot() there are three main elements:
Below is perhaps the most simple code to produce daily and weekly epicurves. Axis scales and labels use default options.
# daily
ggplot(data = central_data, aes(x = date_onset)) + # x column must be class Date
geom_histogram(binwidth = 1)+ # date values binned by 1 day
labs(title = "Daily")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# weekly
ggplot(data = central_data, aes(x = date_onset)) +
geom_histogram(binwidth = 7)+ # date values binned each 7 days (arbitrary 7 days!)
labs(title = "Weekly")## Warning: Removed 26 rows containing non-finite values (stat_bin).
CAUTION: Using
binwidth = 7 starts the first bin at the first case, which could be any day of the week! To create specific Monday or Sunday weeks, see below .
To create weekly epicurves where the bins begin on a specific day of the week (e.g. Monday, Sunday), specify the histogram breaks = manually (not binwidth). This can be done by creating a sequence of dates using seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such weekly breaks is below:
seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days")To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.
This code is applied to create the histogram breaks, and also the breaks for the date labels. Read more about the date labels in the Modifications tab. Defining your breaks like above will be necessary if your weekly bins are not by Monday weeks.
Below is detailed code to produce weekly epicurves for Monday and Sunday weeks. See the tab on Modifications (axes) to learn the nuances of date-axis label management.
Monday weeks
Of note:
week_start = 1) before the earliest case and to end the Monday after the last case (see explanation above).date_breaks = within scale_x_date(), which also uses Monday weeks. Sunday weeks use a different method.date_minor_breaks = within scale_x_date(), again because this plot is for Monday weeks. Sunday weeks use a different method.expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.geom_histogram()# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) +
# make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
geom_histogram(
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"), # bins are 7-days
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of fill within bars
# x-axis labels
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # remove excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases (Monday weeks)",
subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Mondays",
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 26 rows containing non-finite values (stat_bin).
Sunday weeks
The below code creates a histogram of the rows, using a date column as the x-axis. Of note:
week_start = 7) before the earliest case and to end the Monday after the last case (see explanation above).breaks = and minor_breaks = within scale_x_date(). You cannot use the scale_x_date() arguments of date_breaks and date_minor_breaks as these align with Monday weeks.expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.geom_histogram()# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) +
# For histogram, manually specify bin break points: starts the Sunday before first case, end Sunday after last case
geom_histogram(
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"), # bins are 7-days
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of fill within bars
# The labels on the x-axis
scale_x_date(expand = c(0,0),
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
date_labels = "%d\n%b\n'%y")+ # day, above month abbrev., above 2-digit year
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases (Sunday weeks)",
subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Sundays",
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 26 rows containing non-finite values (stat_bin).
TIP: Remember that date-axis labels are independent from the aggregation of the data into bars
To modify the aggregation of data into bins/bars, do one of the following:
binwidth = within geom_histogram() - for a column of class Date, the given number is interpreted in daysbreaks = as a sequence of bin break-point datesggplot(). See the tab on aggregated counts for more information.To modify the date labels, use scale_x_date() in one of these ways:
date_breaks = to specify label frequency (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)date_minor_breaks = to specify frequency of minor vertical gridlines between date labelsexpand = c(0,0) to begin the labels at the first bar (otherwise, first label will shift forward depending on specified frequency)date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)breaks = and minor_breaks = by providing a sequence of dates for breaksdate_labels = for formatting as described aboveTo create a sequence of dates
You can use seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such flexible breaks is below:
seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days")To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.
If using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.
Set maximum and minimum date values using limits = c() within scale_x_date(). E.g. scale_x_date(limits = c(as.Date("2014-04-01), NA)) sets a minimum but leaves the maximum open.
CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian() to your plot, which acts as a “zoom” without removing data.
DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.
https://rdrr.io/r/base/strptime.html —– see all % shortcuts
Below is a demonstration of some plots where the bins and the plot labels/gridlines are aligned and not aligned:
Click “Code” to see the code
# 7-day binwidth defaults
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
geom_histogram(
binwidth = 7, # 7 days per bin (! starts at first case!)
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of bar fill
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\ndefault axis labels/ticks not aligned")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# 7-day bins + Monday labels
#############################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
binwidth = 7, # 7-day bins with start at first case
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # Monday every 3 weeks
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
scale_y_continuous(
expand = c(0,0))+ # remove excess space under x-axis, make flush with labels
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nDate labels and gridlines on Mondays")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# 7-day bins + Months
#####################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
binwidth = 7,
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "months", # 1st of month
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
scale_y_continuous(
expand = c(0,0))+ # remove excess space under x-axis, make flush with labels
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nGridlines at 1st of each month (with labels) and weekly on Mondays\nLabels on 1st of each month")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
# histogram breaks set to 7 days beginning Monday before first case
breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # Monday every 3 weeks
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
labs(
title = "ALIGNED Mondays",
subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
# histogram breaks set to 7 days beginning Sunday before first case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0),
# date label breaks set to every 3 weeks beginning Sunday before first case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
# gridlines set to weekly beginning Sunday before first case
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
date_labels = "%d\n%b\n'%y")+ # label format
labs(title = "ALIGNED Sundays",
subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")## Warning: Removed 26 rows containing non-finite values (stat_bin).
# Check values of bars by creating dataframe of grouped values
# central_tab <- central_data %>%
# mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
# group_by(week, .drop=F) %>%
# summarize(n = n()) %>%
# mutate(groups_3wk = 1:(nrow(central_tab)+1) %/% 3) %>%
# group_by(groups_3wk) %>%
# summarize(n = n())Designate a column containing groups
In any of the code template (Sunday weeks, Monday weeks), make the following changes:
aes() within the geom_histogram() (don’t forget comma afterward)aes(), provide the grouping column name to group = and fill = (no quotes needed). group is necessary, while fill changes the color of the bar.fill = argument outside of the aes(), as it will override the one insideaes() will apply by group, whereas any outside will apply to all bars (e.g. you may want color = outside, so each bar has the same color perimeter/border)geom_histogram(
aes(group = gender, fill = gender))
Adjust colors:
scale_fill_manual() (note scale_color_manual() is different!).
values = argument to apply a vector of colors.na.value = to specify a color for missing values.labels = argument in scale_fill_manual() change the legend text labels - it is easy to accidentally give labels in the incorrect order and have an incorrect legend! It is recommended to instead convert the group column to class Factor and designate factor labels and order, as explained below.Adjust the stacking order and Legend
Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order).
Step 1: Before making the ggplot, convert the group column to class Factor using factor() from base R.
For example, with a column “gender” with values “m” and “f” and NA, this can be put in a mutate() command as:
dataset <- dataset %>%
mutate(gender = factor(gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
The above code establishes the levels, in the ordering that missing values are “first” (and will appear on top). Then the labels that will show are given in the same order. Lastly, the exclude statement ensures that NA is included in the ordering (by default factor() ignores NA).
Read more about factors in their dedicated handbook page (LINK).
Adjusting the legend
Read more about legends in the ggplot tips page. Here are a few highlights:
theme(legend.position = "top") (or “bottom”, “left”, “right”)theme(legend.direction = "horizontal")theme(legend.title = element_blank()) to have no titleSee the page of ggplot() tips for more details on legends.
These steps are shown in the example below:
Click “Code” to see the code
########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days") # bins are 7-days
# Set gender as factor and missing values as first level (to show on top)
central_data <- linelist %>%
filter(hospital == "Central Hospital") %>%
mutate(gender = factor(
gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
# make plot
###########
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
aes(group = gender, fill = gender), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# The labels on the x-axis
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(
expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
#scale of colors and legend labels
scale_fill_manual(
values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(
plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(
title = "Weekly incidence of cases, by gender",
subtitle = "Subtitle",
fill = "Gender", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 26 rows containing non-finite values (stat_bin).
Display bars side-by-side
Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge".
If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples) (see tab). To improve readability in this example, missing gender values are removed.
Click “Code” to see the code
########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days") # bins are 7-days
# New dataset without rows missing gender
central_data_dodge <- linelist %>%
filter(hospital == "Central Hospital") %>%
filter(!is.na(gender)) %>% # remove rows missing gender
mutate(gender = factor(gender, # factor now has only two levels (missing not included)
levels = c("f", "m"),
labels = c("Female", "Male")))
# make plot
###########
ggplot(central_data_dodge, aes(x = date_onset)) +
geom_histogram(
aes(group = gender, fill = gender), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks,
position = "dodge")+ # see breaks defined above
# The labels on the x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
#scale of colors and legend labels
scale_fill_manual(values = c("pink", "lightblue"))+ # specify fill colors ("values") - attention to order!
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases, by gender",
subtitle = "Subtitle",
fill = "Gender", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 24 rows containing non-finite values (stat_bin).
As with other ggplots, you can create facetted plots (“small multiples”) off values in a column. As explained in the ggplot tips page of this handbook, you can use either:
facet_wrap()facet_grid()For epicurves, facet_wrap() is typically easiest as it is likely that you only need to facet on one column. The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the new plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the new plot.
Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).
Free axes
You will need to decide whether the scales (scales =) of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). You can also specify “free_x” or “free_y” to release in only one dimension.
Number of cols and rows
This can be specified with ncol = and nrow = within facet_wrap().
Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.
Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:
strip.text = element_text() (size, colour, face, angle…)strip.background = element_rect() (e.g. element_rect(fill=“red”))The position of the strip can be modified as the strip.position = argument within facet_wrap() (e.g. “bottom”, “top”, “left”, “right”)
Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.
Make a labeller like this, using the function as_labeller() from ggplot2:
my_labels <- as_labeller(c(
"0-4" = "Ages 0-4",
"5-9" = "Ages 5-9",
"10-14" = "Ages 10-14",
"15-19" = "Ages 15-19",
"20-29" = "Ages 20-29",
"30-49" = "Ages 30-49",
"50-69" = "Ages 50-69",
"70+" = "Over age 70"))An example plot
Faceted by column age_cat. Click “Code” to see the code.
# make plot
###########
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# The labels on the x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(face = "bold", size = 10),
strip.background = element_rect(fill = "grey"))+ # axis titles in bold
# create facets
facet_wrap(~age_cat,
ncol = 4,
strip.position = "top",
labeller = my_labels)+
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 26 rows containing non-finite values (stat_bin).
See this link for more information on labellers.
Add total epidemic to background
Add a separate geom_histogram() command before the current one. Specify that the data used is the data without the column used for faceting (see select()). Then, specify a color like “grey” and a degree of transparency to make it appear in the background.
geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+
Note that the y-axis maximum is now based on the height of the entire epidemic. Click “Code” to see the code.
ggplot(central_data, aes(x = date_onset)) +
# for background shadow of whole outbreak
geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+
# actual epicurves by group
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# Labels on x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(face = "bold", size = 10),
strip.background = element_rect(fill = "white"))+ # axis titles in bold
# create facets
facet_wrap(~age_cat, # each plot is one value of age_cat
ncol = 4, # number of columns
strip.position = "top", # position of the facet title/strip
labeller = my_labels)+ # labeller defines above
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 208 rows containing non-finite values (stat_bin).
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Create one facet with ALL data
To do this, you duplicate all the data (double the number of rows in the dataset) and in the faceted column have a new value (e.g. “all”) which indicates all the duplicated rows. A helped function is below that enables this:
# Define helper function
CreateAllFacet <- function(df, col){
df$facet <- df[[col]]
temp <- df
temp$facet <- "all"
merged <-rbind(temp, df)
# ensure the facet value is a factor
merged[[col]] <- as.factor(merged[[col]])
return(merged)
}
# Create dataset that is duplicated, to show "all zones" as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
mutate(facet = factor(facet,
levels = c("all", "0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70+")))
# check
table(central_data2$facet, useNA = "always")##
## all 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ <NA>
## 454 92 55 82 62 79 64 12 0 8
Notable changes to the ggplot command are:
facet_wrap(facet~.), and ncol = 1You may also need to adjust the width and height of the save plot image (see ggsave()).
ggplot(central_data2, aes(x = date_onset)) +
# actual epicurves by group
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# Labels on x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom")+
# create facets
facet_wrap(facet~. , # each plot is one value of facet
ncol = 1)+
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))## Warning: Removed 52 rows containing non-finite values (stat_bin).
Add a moving averages to a ggplot() epicurve in one of two ways:
geom_line())ggplot() commandIn this approach, the moving average is calculated in the dataset prior to plotting:
mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.ggplot(), a geom_line() is added after the histogram, reflecting the moving average.See the helpful online vignette for the slider package
pacman::p_load(slider) # slider used to calculate rolling averages
# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>%
## count cases by date
count(date_onset,
name = "new_cases") %>% # name of new column
filter(!is.na(date_onset)) %>% # remove cases with missing date_onset
## calculate the average number of cases in the preceding 7 days
mutate(
avg_7day = slider::slide_index( # create new column
new_cases, # calculate based on value in new_cases column
.i = date_onset, # index is date_onset col, so non-present dates are included in window
.f = ~mean(.x, na.rm = TRUE), # function is mean() with missing values removed
.before = 6, # window is the day and 6-days before
.complete = FALSE), # must be FALSE for unlist() to work in next step
avg_7day = unlist(avg_7day))
# plot
######
ggplot(data = ll_counts_7day, aes(x = date_onset)) +
geom_histogram(aes(y = new_cases),
fill="#92a8d1",
stat = "identity",
position = "stack",
colour = "#92a8d1")+
geom_line(aes(y = avg_7day, lty = "7-day \nrolling avg"),
color="red",
size = 1) +
scale_x_date(date_breaks = "1 month",
date_labels = '%d/%m',
expand = c(0,0)) +
scale_y_continuous(expand = c(0,0),
limits = c(0, NA)) +
labs(x="",
y ="Number of confirmed cases",
fill = "Legend")+
theme_minimal()+
theme(legend.title = element_blank()) # removes title of legend## Warning: Ignoring unknown parameters: binwidth, bins, pad
Using the tidyquant package to calculate the moving average on-the-fly (within ggplot()).
This option is more difficult to modify than pre-calculating the moving average. By default,geom_ma() uses the Simple Moving Average (SMA) (TRR::SMA()). See documentation by entering ?SMA in your R console. Calculates the arithmatic mean over the past n observations. Also note how the moving average does not begin as early as the previous example.
library(tidyquant)
# make daily count data
#######################
ll_counts_7day <- linelist %>%
count(date_onset, name = "daily_cases")
# plot
######
ggplot(data = ll_counts_7day, # use daily count data
aes(x = date_onset, # date x-axis
y = daily_cases))+ # counts
# histogram in the background
geom_histogram(stat = "identity", # height = value in the cell, not number of rows
color = "#92a8d1", # color of lines within histogram
fill = "#92a8d1")+ # color of histogram
# moving average line
tidyquant::geom_ma(n = 7, # window width
size = 2, # line size
color = "black", # line color
lty = "solid" # line type ()
)+
# labels for x-axis
scale_x_date(date_breaks = "2 months", # labels every 2 months
date_minor_breaks = "1 month", # gridlines every month
date_labels = '%b\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel2")+
theme_minimal()+
labs(x = "Date of onset",
y = "Daily case incidence",
title = "Daily case incidence, with 7-day moving average")## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Warning: Removed 1 rows containing missing values (position_stack).
The most recent data shown in epicurves should often be marked as tentative, or subject to reporting delays. This can be done in by adding a vertical line and/or rectangle over a specified number of days. Here are two options:
annotate():annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.geom_segment() and geom_rect():annotate()CAUTION: While you can use geom_rect() to draw a rectangle, adjusting the transparency (alpha) does not work in a linelist context. This function overlays a rectangle for each observation/row!. Try a very low alpha (e.g. 0.01), or use annotate(geom = "rect") as shown.
annotate()annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.annotate() online exampleggplot(central_data, aes(x = date_onset)) +
# histogram
geom_histogram(
breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
# scales
scale_y_continuous(expand = c(0,0))+
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "1 month", # 1st of month
date_minor_breaks = "1 month", # 1st of month
date_labels = "%b\n'%y")+ # label format
# labels and theme
labs(title = "Using annotate()\nRectangle and line showing that data from last 21-days are tentative",
x = "Week of symptom onset",
y = "Weekly case indicence")+
theme_minimal()+
# add semi-transparent red rectangle to tentative data
annotate("rect",
xmin = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
xmax = as.Date(Inf), # note must be wrapped in as.Date()
ymin = 0,
ymax = Inf,
alpha = 0.2, # alpha easy and intuitive to adjust using annotate()
fill = "red")+
# add black vertical line on top of other layers
annotate("segment",
x = max(central_data$date_onset, na.rm = T) - 21, # 21 days before last data
xend = max(central_data$date_onset, na.rm = T) - 21,
y = 0, # line begins at y = 0
yend = Inf, # line to top of plot
size = 2, # line size
color = "black",
lty = "solid")+ # linetype e.g. "solid", "dashed"
# add text in rectangle
annotate("text",
x = max(central_data$date_onset, na.rm = T) - 15,
y = 20,
label = "Subject to reporting delays",
angle = 90)## Warning: Removed 26 rows containing non-finite values (stat_bin).
The same black vertical line can be achieved with the code below, but using geom_vline() you lose the ability to control the height:
geom_vline(xintercept = max(central_data$date_onset, na.rm = T) - 21,
size = 2,
color = "black")
geom_segment() and geom_rect()ggplot(central_data, aes(x = date_onset)) +
# histogram
geom_histogram(
breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
# scales
scale_y_continuous(expand = c(0,0))+
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # Monday every 3 weeks
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
# labels and theme
labs(title = "Using geom_segment() and geom_rect()\nRectangle and line showing that data from last 21-days are tentative",
subtitle = "")+
theme_minimal()+
# make rectangle covering last 21 days
geom_rect(aes(
xmin = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
xmax = as.Date(Inf), # note must be wrapped in as.Date()
ymin = 0,
ymax = Inf,
color = "Reporting delays\npossible"), # sets label for legend (note: is within aes())
alpha = .002, # !!! Difficult to adjust transparency with this option
fill = "red")+
# make vertical line
geom_segment(aes(x = max(central_data$date_onset, na.rm = T) - 21,
xend = max(central_data$date_onset, na.rm = T) - 21,
y = 0,
yend = Inf),
color = "black",
lty = "solid",
size = 2)+
theme(legend.title = element_blank()) # remove title of legend## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.
## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.
## Warning: Use of `central_data$date_onset` is discouraged. Use `date_onset` instead.
## Warning: Removed 26 rows containing non-finite values (stat_bin).
Two axes - TBD
Here is an option if you want multi-level date labels, without duplicating the lower label levels (e.g. for year or month).
Remember, you can can use tools like \n within the date_labels or labels arguments to put parts of each label on a new line below. However, the code below helps you take years or months (for example) on a lower line and only once.
A few notes on the code below:
Aggregate the weekly counts
# Create dataset of case counts by week
#######################################
central_weekly <- linelist %>%
filter(hospital == "Central Hospital") %>% # filter linelist
mutate(week = lubridate::floor_date(date_onset, unit = "weeks")) %>%
count(week, .drop=F) %>% # summarize weekly case counts
filter(!is.na(week)) %>% # remove cases with missing onset_date
complete(week = seq.Date(from = min(week), # fill-in all weeks with no cases reported
to = max(week),
by = "week"))Make plots
# plot
######
ggplot(central_weekly) +
geom_line(aes(x = week, y = n), # make line, specify x and y
stat = "identity") + # because line height is count number
scale_x_date(date_labels="%b", # date label format show month
date_breaks="month", # date labels on 1st of each month
expand=c(0,0)) + # remove excess space
facet_grid(~lubridate::year(week), # facet on year (of Date class column)
space="free_x",
scales="free_x", # x-axes adapt to data range (not "fixed")
switch="x") + # facet labels (year) on bottom
theme_bw() +
theme(strip.placement = "outside", # facet labels placement
strip.background = element_rect(fill = NA, # facet labels no fill grey border
colour = "grey50"),
panel.spacing = unit(0, "cm"))+ # no space between facet panels
labs(title = "Nested year labels, grey label border")# plot no border
################
ggplot(central_weekly,
aes(x = week, y = n)) + # establish x and y for entire plot
geom_line(stat = "identity", # make line, line height is count number
color = "#69b3a2") + # line color
geom_point(size=1, color="#69b3a2") + # make points at the weekly data points
geom_area(fill = "#69b3a2", # fill area below line
alpha = 0.4)+ # fill transparency
scale_x_date(date_labels="%b", # date label format show month
date_breaks="month", # date labels on 1st of each month
expand=c(0,0)) + # remove excess space
facet_grid(~lubridate::year(week), # facet on year (of Date class column)
space="free_x",
scales="free_x", # x-axes adapt to data range (not "fixed")
switch="x") + # facet labels (year) on bottom
theme_bw() +
theme(strip.placement = "outside", # facet label placement
strip.background = element_blank(), # no facet lable background
panel.grid.minor.x = element_blank(),
panel.border = element_rect(colour="grey40"), # grey border to facet PANEL
panel.spacing=unit(0,"cm"))+ # No space between facet panels
labs(title = "Nested year labels - points, shaded, no label border")## Warning: Removed 3 rows containing missing values (position_stack).
## Warning: Removed 3 rows containing missing values (geom_point).
The above techniques were adapted from this and this post on stackoverflow.com.
To learn generally how to group and aggregate data, see the handbook page on Grouping/Aggregating.
In this circumstance, we demonstrate aggregating into weeks, months, and days.
Create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.
To aggregate into weeks and show ALL weeks (even ones with no cases), do this:
mutate(), using floor_date() from the lubridate package:
unit = to set the desired time unit, e.g. "week`week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)complete() to ensure that all weeks appear - even those with no cases.For example:
# Make dataset of weekly case counts
weekly_counts <- linelist %>%
mutate(
week = lubridate::floor_date(date_onset,
unit = "week")) %>% # new column of week of onset
count(week) %>% # group data by week and count rows per group
filter(!is.na(week)) %>% # remove entries for cases missing date_onset
complete(week = seq.Date(from = min(week), # fill-in all weeks with no cases reported
to = max(week),
by="week")) %>%
ungroup() # deactivate groupingHere are the first 50 rows of the resulting dataframe:
Alternatively, you can use the aweek package’s date2week() function. As shown below, set week_start = to “Sunday”, or “Monday”, etc. Set floor_date = TRUE so the output is YYYY-Www. Set factor = TRUE so that all possible weeks are included, even if there are no cases (this replaces the complete() step in the lubridate approach above). You can also use numeric = TRUE if you want only the week number (note this will not distinguish between years).
# Make dataset of weekly case counts
weekly_counts <- linelist %>%
mutate(week = aweek::date2week(date_onset, # new column of week of onset
floor_day = T, # show as weeks without weekday
factor = TRUE)) %>% # include all possible weeks
count(week) %>%
ungroup() # deactivate grouping
# Optional: add column of start DATE for each week - e.g. for ggplot() when date x-axis is expected
# note: add this step AFTER the above code, to ensure all weeks are present
weekly_counts <- weekly_counts %>%
mutate(week_as_date = aweek::week2date(week, week_start = "Monday")) # output is Monday date of each weekTo aggregate cases into months, again use floor_date() from the lubridate package, but with the argument unit = "months". This rounds each date down to the 1st of its month. The output will be class Date.
Note that in the complete() step we also use “months”
# Make dataset of weekly case counts
monthly_counts <- linelist %>%
mutate(month = lubridate::floor_date(date_onset, unit = "months")) %>% # new column, 1st of month of onset
count(month) %>%
filter(!is.na(month)) %>%
complete(month = seq.Date(min(month), # fill-in all months with no cases reported
max(month),
by="month")) To aggregate a linelist into days, use the same approach but there is no need to create a new column. Use group_by() on the date column (e.g. date_onset).
If plotting a histogram, missing days in the data are not a problem as long as the column is class Date. However, it may be important for other types of plots or tables to have all possible days apear in the data. This is done with: tidyr::complete()
# Make dataset of weekly case counts
daily_counts <- linelist %>%
count(date_onset) %>% # count number of rows per unique date
filter(!is.na(date_onset)) %>% # remove aggregation of rows that were missing date_onset
complete(date_onset = seq.Date(min(date_onset), # ensure all days appear
max(date_onset),
by="day")) Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve from ggplot() but the code will be slightly different. The incidence package does not support aggregate data.
This section will utilize the count_data dataset that was imported earlier, in the data preparation section. It is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.
As before, we must ensure date variables are correctly classified.
# Convert Date variable to Date class
class(count_data$date_hospitalisation)## [1] "Date"
We can plot a daily epicurve from these data. Here are the differences:
y = to the counts column within the primary aesthetics aes()stat = "identity" within geom_histogram() indicates that the y-values could be counts from the y = column in aes()ggplot(data = count_data, aes(x = as.Date(date_hospitalisation), y = n_cases))+
geom_histogram(stat = "identity")+
labs(x = "Week of report",
y = "Number of cases",
Title = "Daily case incidence, from daily count data")## Warning: Ignoring unknown parameters: binwidth, bins, pad
aggregate further
To aggregated further, into weeks, we use the package lubridate and function floor_date(), as described above. Note that we use group_by() and summarize() in place of count() becase we need to sum() case counts instead of just counting the number of rows per group.
# Create weekly dataset with epiweek column
count_data_weekly <- count_data %>%
mutate(epiweek = lubridate::floor_date(date_hospitalisation, "week")) %>%
group_by(hospital, epiweek, .drop=F) %>%
summarize(n_cases_weekly = sum(n_cases, na.rm=T)) ## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.
The first 50 rows of count_data are displayed below.
For the plotting we also specify the factor level order of hospital.
count_data_weekly <- count_data_weekly %>%
mutate(hospital = factor(hospital,
levels = c("Missing", "Port Hospital", "Military Hospital", "Central Hospital", "St. Mark's Maternity Hospital (SMMH)", "Other")))Now plot by epiweek.
ggplot(data = count_data_weekly,
aes(x = epiweek,
y = n_cases_weekly,
group = hospital,
fill = hospital))+
geom_histogram(stat = "identity")+
# labels for x-axis
scale_x_date(date_breaks = "2 months", # labels every 2 months
date_minor_breaks = "1 month", # gridlines every month
date_labels = '%b\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel2")+
theme_minimal()+
labs(x = "Week of onset",
y = "Weekly case incidence",
fill = "Hospital",
title = "Weekly case incidence, from aggregated count data by hospital")## Warning: Ignoring unknown parameters: binwidth, bins, pad
Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.
In ggplot it is difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.
See the handbook page on ggplot tips for details on how to make a second axis.
If beginning with a case linelist, create a new column containing the cumulative number of cases per day in an outbreak using cumsum() from base R:
cumulative_case_counts <- linelist %>%
count(date_onset) %>% # count of rows per day
mutate(
cumulative_cases = cumsum(n) # new column of the cumulative number of rows at each date
)The first 10 rows are shown below:
head(cumulative_case_counts, 10)## date_onset n cumulative_cases
## 1 2014-04-07 1 1
## 2 2014-04-15 1 2
## 3 2014-04-21 2 4
## 4 2014-04-25 1 5
## 5 2014-04-26 1 6
## 6 2014-04-27 1 7
## 7 2014-05-01 2 9
## 8 2014-05-03 1 10
## 9 2014-05-04 1 11
## 10 2014-05-05 1 12
This cumulative column can then be plotted:
plot_cumulative <- ggplot()+
geom_line(
data = cumulative_case_counts,
aes(x = date_onset, y = cumulative_cases),
size = 2,
color = "blue")
plot_cumulative## Warning: Removed 1 row(s) containing missing values (geom_path).
It can also be overlaid onto the epicurve, with dual-axis:
pacman::p_load(cowplot)
plot_cases <- ggplot()+
geom_histogram(
data = linelist,
aes(x = date_onset),
binwidth = 1)+
labs(
y = "Daily cases",
x = "Date of symptom onset"
)+
theme_cowplot()
plot_cumulative <- ggplot()+
geom_line(
data = cumulative_case_counts,
aes(x = date_onset, y = cumulative_cases),
size = 2,
color = "blue")+
scale_y_continuous(
position = "right")+
labs(x = "",
y = "Cumulative cases")+
theme_cowplot()+
theme(
axis.line.x = element_blank(),
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks = element_blank())aligned_plots <- align_plots(plot_cases, plot_cumulative, align="hv", axis="tblr")## Warning: Removed 241 rows containing non-finite values (stat_bin).
## Warning: Removed 1 row(s) containing missing values (geom_path).
ggdraw(aligned_plots[[1]]) + draw_plot(aligned_plots[[2]])Links to other online tutorials or resources.
For appropriate plotting of continuous data, e.g. age, clinical measurements, distance, etc.
As usual, R has built-in functions for quick visualisations, such as the boxplot() or plot functions. You can opt to install additional packages with more functionality - this is often recommended for presentation-ready visualisations. For this we recommend ggplot2.
Visualisations covered here include:
Plots for one continuous variable:
Scatter plots for two continuous variables.
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
## Warning: Removed 88 rows containing non-finite values (stat_ydensity).
## Warning: Removed 88 rows containing missing values (geom_point).
Preparation includes loading the relevant packages, namely ggplot2, (install.packages("ggplot2") if needed), and ensuring your data is the correct class and format. For the examples in this section, we use the simulated Ebola linelist, focusing on the continuous variables age, ct_blood (CT values), and days_onset_hosp (difference between onset date and hospitalisation).
library(ggplot2)
library(dplyr)
linelist <- linelist %>%
mutate(age = as.numeric(age)) # Converting age to numeric value if neededPlotting one continuous variable
The in-built graphics package comes with the boxplot() function, allowing straight-forward visualisation of a continuous variable for the whole dataset (A below) or within different groups (B and C below). Note how with C, outcome and gender are written as outcome*gender such that the boxplots are for the four combinations of the two columns.
# For total population
graphics::boxplot(linelist$age,
main = "A) One boxplot() for total dataset") # Plot title
# By subgroup
graphics::boxplot(age ~ outcome,
data = linelist, # Here 'data' is specified so no need to write 'linelist$age' in line above.
main = "B) boxplot() by subgroup")
# By crossed subgroups
graphics::boxplot(age ~ outcome*gender,
data = linelist, # Here 'data' is specified so no need to write 'linelist$age' in line above.
main = "C) boxplot() by crossed groups")Some further options with boxplot() shown below are:
# Varying width by sample size
graphics::boxplot(linelist$age ~ linelist$outcome,
varwidth = TRUE, # width varying by sample size
main="A) Proportional boxplot() widths")
# Notched (violin plot), and varying width
boxplot(age ~ outcome,
data=linelist,
notch=TRUE, # notch at median
main="B) Notched boxplot()",
col=(c("gold","darkgreen")),
xlab="Suppliment and Dose")
# Horizontal
boxplot(age ~ outcome,
data=linelist,
horizontal=TRUE, # flip to horizontal
col=(c("gold","darkgreen")),
main="C) Horizontal boxplot()",
xlab="Suppliment and Dose")Plotting two continuous variables
Using base R, we can visualise the relationship between two continuous variables with the plot function.
We see that higher CT values are associated with a smaller time difference between onset date and hospitalisation. Note that the points look aligned as they are rounded values.
plot(linelist$days_onset_hosp, linelist$ct_blood)Code syntax
Ggplot has extensive functionality, and the same code syntax can be used for many different plot types.
A basic breakdown of the ggplot code is as follows:
ggplot(data = linelist)+
geom_XXXX(aes(x = col1, y = col2),
fill = "color")
ggplot() starts off the function. You can specify the data and aesthetics (see next point) within the ggplot bracket, unless you are combining different data sources or plot types into oneaes() stands for ‘aesthetics’, and is where the columns used for the visualisation are specified. For instance aes(x = col1, y = col2) to specify the data used for the x and y values (where y is the continuous variable in these examples).fill specifies the colour of the boxplot areas. One could also write color to specify outline or point colour.geom_XXX specifies what type of plot. Options include:
geom_boxplot() for a boxplotgeom_violin() for a violin plotgeom_jitter() for a jitter plotgeom_point() for a scatter plotFor more see section on ggplot tips).
Plotting one continuous variable
Below is code for creating box plots, to show the distribution of CT values of Ebola patients in an entire dataset and by sub group. Note that for the subgroup breakdowns, the ‘NA’ values are also removed using dplyr, otherwise ggplot plots the age distribution for ‘NA’ as a separate boxplot.
# A) Simple boxplot of one numeric variable
ggplot(data = linelist, aes(y = ct_blood))+ # only y variable given (no x variable)
geom_boxplot()+
ggtitle("A) Simple ggplot() boxplot")
# B) Box plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)),
aes(y = ct_blood, # numeric variable
x = outcome)) + # group variable
geom_boxplot(fill = "gold")+ # create the boxplot and specify colour
ggtitle("B) ggplot() boxplot by gender") # main titleBelow is code for creating violin plots (geom_violin) and jitter plots (geom_jitter) to show age distributions. One can specify that the ‘fill’ or ’color’is also determined by the data, thereby inserting these options within the aes bracket.
# A) Violin plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)),
aes(y = age, # numeric variable
x = outcome, # group variable
fill = outcome))+ # fill variable (color of boxes)
geom_violin()+ # create the violin plot
ggtitle("A) ggplot() violin plot by gender") # main title## Warning: Removed 71 rows containing non-finite values (stat_ydensity).
# B) Jitter plot by group
ggplot(data = linelist %>% filter(!is.na(outcome)),
aes(y = age, # numeric variable
x = outcome, # group variable
color = outcome))+ # Color variable
geom_jitter()+ # create the violin plot
ggtitle("B) ggplot() violin plot by gender") # main title## Warning: Removed 71 rows containing missing values (geom_point).
To examine further subgroups, one can ‘facet’ the graph. This means the plot will be recreased within specified subgroups. One can use:
facet_wrap() - this will recreate the sub-graphs and present them alphabetically (typically, unless stated otherwise). You can invoke certain options to determine the look of the facets, e.g. nrow=1 or ncol=1 to control the number of rows or columns that the faceted plots are arranged within. See plot A below.facet_grid() - this is suited to seeing subgroups for particular combinations of discrete variables. See plot B below.# A) Facet by one variable
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
aes(y = age, x = outcome, fill=outcome))+
geom_boxplot()+
ggtitle("A) A ggplot() boxplot by gender and outcome")+
facet_wrap(~gender, nrow = 1)
# B) Facet across two variables
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
aes(y = age))+
geom_boxplot()+
ggtitle("A) A ggplot() boxplot by gender and outcome")+
facet_grid(outcome~gender)To turn the plot horizontal, flip the coordinates with coord_flip.
# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
aes(y = age, x = outcome, fill=outcome))+
geom_boxplot()+
ggtitle("B) A horizontal ggplot() boxplot by gender and outcome")+
facet_wrap(gender~., ncol=1) +
coord_flip()Plotting two continuous variables
Following similar syntax, geom_point will allow one to plot two continuous variables against eachother in a scatter plot. Here we again use facet_grid to show the relationship between two continuous variables in the linelist. We see that higher CT values are associated with a smaller time difference between onset date and hospitalisation.
# By subgroup
ggplot(data = linelist %>% filter(!is.na(gender) & !is.na(outcome)), # filter retains non-missing gender/outcome
aes(y = days_onset_hosp, x = ct_blood))+
geom_point()+
ggtitle("A horizontal ggplot() boxplot by gender and outcome")+
facet_grid(gender~outcome) ## Warning: Removed 181 rows containing missing values (geom_point).
There is a huge amount of help online, especially with ggplot. see:
For appropriate plotting of categorical data, e.g. the distribution of sex, symptoms, ethnic group, etc.
In this section we cover use of R’s built-in functions or functions from the ggplot2 package to visualise discrete/categorical data. The additional functionality of ggplot2 compared to R means we recommend it for presentation-ready visualisations.
We cover visualising distributions of categorical values, as counts and proportions.
Preparation includes loading the relevant packages, namely ggplot2, (install.packages("ggplot2") if needed), and ensuring your data is the correct class and format.
library(ggplot2)
library(dplyr)For the examples in this section, we use the simulated Ebola linelist, focusing on the discrete variables hospital, outcome, and gender.
For displaying frequencies, you have the option of creating plots based on:
dplyr to create a table of case counts per hospital.Tables can be created using the ‘table’ method for built-in graphics
#Table method
outcome_nbar <- table(linelist$outcome)
class(outcome_nbar) # View class of object## [1] "table"
outcome_nbar # View full table##
## Death Recover
## 2582 1983
Or using other data management packages such as dplyr
#Dplyr method
outcome_n <- linelist %>%
group_by(outcome) %>%
count
class(outcome_n) # View class of object## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
outcome_n #View full table## # A tibble: 3 x 2
## # Groups: outcome [3]
## outcome n
## <chr> <int>
## 1 Death 2582
## 2 Recover 1983
## 3 <NA> 1323
Bar plots
To create bar plots in R, we create a frequency table using the table function. This creates an object of a table class, that R can recognise for plotting. We can create a simple frequency graph showing Ebola case outcomes (A), or add in colours to present outcomes by gender (B).
Note that NA values are excluded from these plots by default.
# A) Outcomes in all cases
outcome_nbar <- table(linelist$outcome)
barplot(outcome_nbar, main= "A) Outcomes")
# B) Outcomes in all cases by gender of case
outcome_nbar2 <- table(linelist$outcome, linelist$gender) # The first column is for groupings within a bar, the second is for the separate bars
barplot(outcome_nbar2, legend.text=TRUE, main = "B) Outcomes by gender") # Specify inclusion of legendCode syntax
Ggplot has extensive functionality, and the same code syntax can be used for many different plot types.
Similar to the plotting continuous data section, basic breakdown of the ggplot code is as follows:
ggplot(data = linelist)+
geom_XXXX(aes(x = col1, y = col2),
fill = "color")
ggplot() starts off the function. You can specify the data and aesthetics (see next point) within the ggplot bracket, unless you are combining different data sources or plot types into oneaes() stands for ‘aesthetics’, and is where the columns used for the visualisation are specified. For instance aes(x = col1, y = col2) to specify the data used for the x and y values.fill specifies the colour of bars, or of the subgroups if specified within the aes breacket.geom_XXX specifies what type of plot. Options include:
geom_bar() for a bar chart based on a linelistgeom_col() for a bar chart based on a table with values (see preparation section)For more see section on ggplot tips).
Bar charts using raw data
Below is code using geom_bar for creating some simple bar charts to show frequencies of Ebola patient outcomes: A) For all cases, and B) By hospital.
In the aes bracket, only x needs to be specified. Ggplot knows that y will be the number of observations that fall into those categories. Note that a bar is generated for cases with missing outcomes; these may be cases without known outcome or who are still currently sick.
# A) Outcomes in all cases
ggplot(linelist) +
geom_bar(aes(x=outcome)) +
labs(title = "A) Number of recovered and dead Ebola cases")
# B) Outcomes in all cases by hosptial
ggplot(linelist %>% filter(!is.na(outcome))) +
geom_bar(aes(x=outcome, fill = hospital)) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "B) Number of recovered and dead Ebola cases, by hospital")Bar charts using processed data
As above, below is code using geom_col for creating simple bar charts to show frequencies of Ebola patient outcomes: A) For all cases, and B) By hospital. Note that a bar is generated for cases with missing outcomes; these may be cases without known outcome or who are still currently sick. We remove them in graph B.
With geom_col, both x and y need to be specified. Here x is the discrete variable outcome and y is the generated frequencies column n. To create B), an additional table needs to be created for frequencies of the combined categories outcome and hospital.
# A) Outcomes in all cases
ggplot(outcome_n) +
geom_col(aes(x=outcome, y = n)) +
theme_minimal() +
labs(title = "A) Number of recovered and dead Ebola cases")
outcome_n2 <- linelist %>%
group_by(hospital, outcome) %>%
count()
head(outcome_n2) #Preview data## # A tibble: 6 x 3
## # Groups: hospital, outcome [6]
## hospital outcome n
## <chr> <chr> <int>
## 1 Central Hospital Death 193
## 2 Central Hospital Recover 165
## 3 Central Hospital <NA> 96
## 4 Military Hospital Death 399
## 5 Military Hospital Recover 309
## 6 Military Hospital <NA> 188
# B) Outcomes in all cases by hospital
ggplot(outcome_n2 %>% filter(!is.na(outcome))) + #Remove missing outcomes
geom_col(aes(x=outcome, y = n, fill = hospital)) +
theme_minimal() +
labs(title = "B) Number of recovered and dead Ebola cases, by hospital")Rather than presenting frequencies, we can also calculate proportions and graph these, as shown in A) below. Here rather than showing the distribution of hospital of admission among those who died and recovered, we show the outcome distribution of patients by hospital.
As shown in B, we can also change the stacked bar plot appearance, so that each subcategory is a separate bar, using position = "dodge". This is sometimes appropriate in that it allows for easier comparison of the height of each category. Both examples below also use coord_flip for horizontal plots.
outcome_n2 <- outcome_n2 %>%
group_by(hospital) %>%
mutate(prop = n/sum(n)) # Calculate proportions
# A) % outcome by hospital
ggplot(outcome_n2) +
geom_col(aes(x=hospital, y = prop, fill = outcome)) +
coord_flip() + # Change the view to horizontal so it is easier to read
labs(title = "A) Proportion of recovered and dead Ebola cases by hospital - option 1")
# B) Outcomes in all cases by hosptial
ggplot(outcome_n2) +
geom_col(aes(x=hospital, y = prop, fill = outcome), position = "dodge") +
coord_flip() + # Change the view to horizontal so it is easier to read
labs(title = "B) Proportion of recovered and dead Ebola cases, by hospital - option 2")We can also use faceting to create futher mini-graphs, as detailed in the continuous data visualisation section. Specifically, one can use:
facet_wrap() - this will recreate the sub-graphs and present them alphabetically (typically, unless stated otherwise). You can invoke certain options to determine the look of the facets, e.g. nrow=1 or ncol=1 to control the number of rows or columns that the faceted plots are arranged within. See plot A below.facet_grid() - this is suited to seeing subgroups for particular combinations of discrete variables. See plot B below.There is a huge amount of help online, especially with ggplot. see:
This section demonstrates how to create publication-ready tables, which can be inserted directly into shareable documents, including R Markdown outputs.
We build on previous sections on basic statistics and creating summary tables (e.g. using dplyr and gtsummary and show how to create publication-read tables. The primary package we use is flextable, which is compatible with multiple R Markdown formats, including html and word documents.
Example:
Table of Ebola patients with outcome information: Number, proportion, and CT values of cases who recovered and died
## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.
Hospital | Total cases with known outcome | Recovered | Died | ||||
Number | Proportion of cases with outcomes | CT value | Number | Proportion of cases with outcomes | CT value | ||
Port Hospital | 1,364 | 579 | 42.45 | 21.0 | 785 | 57.55 | 22.0 |
Military Hospital | 708 | 309 | 43.64 | 21.0 | 399 | 56.36 | 22.0 |
Other | 685 | 290 | 42.34 | 21.0 | 395 | 57.66 | 22.0 |
Central Hospital | 358 | 165 | 46.09 | 22.0 | 193 | 53.91 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.77 | 21.0 | 199 | 61.23 | 22.0 |
Using packages discussed in other sections such as gtsummary and dplyr, create a table with the content of interest, with the correct columns and rows.
Here we create a simple summary table of patient outcomes using the Ebola linelist. We are interested in knowing the number and proportion of patients that recover or died, as well as their median CT values, by hospital of admission.
table <- linelist %>%
group_by(hospital, outcome) %>%
filter(!is.na(outcome) & hospital!="Missing") %>% # Remove cases with missing outcome/hospital
summarise(ct_value = median(ct_blood), N = n()) %>% # Calculate indicators of interest
pivot_wider(values_from=c(ct_value, N), names_from = outcome) %>% #Pivot from long to wide
mutate(`N known` = `N_Death` + N_Recover) %>% # Calculate total number
arrange(-`N known`) %>% # Arrange rows from highest to lowest total
mutate(`Prop_Death` = `N_Death`/`N known`*100, # Calculate proportions
`Prop_Recover` = `N_Recover`/`N known`*100) %>%
select(hospital, `N known`, `N_Recover`, `Prop_Recover`, ct_value_Recover,
`N_Death`, `Prop_Death`, ct_value_Death) # Re-order columns ## `summarise()` has grouped output by 'hospital'. You can override using the `.groups` argument.
table## # A tibble: 5 x 8
## # Groups: hospital [5]
## hospital `N known` N_Recover Prop_Recover ct_value_Recover N_Death Prop_Death ct_value_Death
## <chr> <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Port Hospital 1364 579 42.4 21 785 57.6 22
## 2 Military Hospital 708 309 43.6 21 399 56.4 22
## 3 Other 685 290 42.3 21 395 57.7 22
## 4 Central Hospital 358 165 46.1 22 193 53.9 22
## 5 St. Mark's Maternity Hospital (SMMH) 325 126 38.8 21 199 61.2 22
Load, and install if necessary, flextable, which we will use to convert the above table into a fully formatted and presentable table.
library(flextable)Creating a flextable
To create and manage flextable objects, we pass the table object through the flextable function and progressively add more formatting and features using the dplyr syntax.
The syntax of each line of flextable code is as follows:
function(table, i = X, j = X, part = "X"), where:
table is the name of the table object, although does not need to be stated if using the dplyr syntax and the table name has already been specified (see examples).width to determine column widths, bg to set background colours, align to set whether text is centre/right/left aligned, and so on.part refers to which part of the table the function is being applied to. E.g. “header”, “body” or “all”.i specifies the row to apply the function to, where ‘X’ is the row number. If multiple rows, e.g. the first to third rows, one can specify: i = c(1:3). Note if ‘body’ is selected, the first row starts from underneath the header section.j specifies the column to apply the function to, where ‘x’ is the column number or name. If multiple rows, e.g. the fifth and sixth, one can specify: j = c(5,6).ftable <- flextable(table)
ftablehospital | N known | N_Recover | Prop_Recover | ct_value_Recover | N_Death | Prop_Death | ct_value_Death |
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
We see immediately that it has suboptimal spacing, and the proportions have too many decimal places.
Formatting cell content
We edit the proportion colums to one decimal place using flextable code. Note this could also have been done at data management stage with the round() function.
ftable <- colformat_num(ftable, j = c(4,7), digits = 1)
ftablehospital | N known | N_Recover | Prop_Recover | ct_value_Recover | N_Death | Prop_Death | ct_value_Death |
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Formatting column width
We can use the autofit() function, which nicely stretches out the table so that each cell only has one row of text.
ftable %>% autofit()hospital | N known | N_Recover | Prop_Recover | ct_value_Recover | N_Death | Prop_Death | ct_value_Death |
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
However, this might not always be appropriate, especially if there are very long values within cells, meaning the table might not fit on the page.
Instead, we can specify widths. It can take some playing around to know what width value to put. In the example below, we give specify different widths to columns 1, 2, and columns 4 to 7.
ftable <- ftable %>%
width(j=1, width = 2.7) %>%
width(j=2, width = 1.5) %>%
width(j=c(4,5,7,8), width = 1)
ftablehospital | N known | N_Recover | Prop_Recover | ct_value_Recover | N_Death | Prop_Death | ct_value_Death |
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Column headers
We want to clearer headers for easier interpretation of table contents.
First we can add an extra header layer for clarity. We do this with the add_header_row with ‘top’ set to true, so that columns about the same subgroups can be grouped together. We also rename the now-second header layer. Finally we merge the columns in the top header row.
ftable <- ftable %>%
add_header_row( values = c("Hospital", "Total cases with known outcome", "Recovered", "", "", "Died", "", ""), top = T) %>%
set_header_labels(hospital = "",
`N known` = "",
N_Recover = "Total",
Prop_Recover = "% of cases",
ct_value_Recover = "Median CT values",
N_Death = "Total",
Prop_Death = "% of cases",
ct_value_Death = "Median CT values") %>%
merge_at(i = 1, j = 3:5, part = "header") %>%
merge_at(i = 1, j = 6:8, part = "header")
ftableHospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Formatting borders and background
Flextable has default borders that do not respond well to additional header levels. We start from scratch by removing the existing borders with border_removal. Then we add a black line to the bottom of the table using hlinw, by specifying the 5th row of the table body. Flextable will default add a line to the bottom of the row. In order to add black lines to the top of sections, we need to use hline_top.
We also use fp_border here, which actually applied the border. This is a function from the officer package.
library(officer)
ftable <- ftable %>%
border_remove() %>%
hline(part = "body", i=5, border = fp_border(color="black", width=2)) %>%
hline_top(part = "header", border = fp_border(color="black", width=2)) %>%
hline_top(part = "body", border = fp_border(color="black", width=2))
ftableHospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Font and alignment
We centre-align all columns aside from the left-most column with the hospital names, using the align function.
ftable <- ftable %>%
align(align = "center", j = c(2:8), part = "all")
ftableHospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Additionally, we can increase the header font size and change then to bold.
ftable <- ftable %>%
fontsize(i = 1, size = 12, part = "header") %>%
bold(i = 1, bold = TRUE, part = "header")
ftableHospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Background
To distinguish the content of the table from the headers, we may want to add additional formatting. e.g. changing the background colour. In this example we change the table body to gray.
ftable <- ftable %>%
bg(., part = "body", bg = "gray95")
ftable Hospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
We can highlight all values in a column that meet a certain rule, e.g. where more than 55% of cases died.
ftable %>%
bg(j=7, i= ~ Prop_Death >=55, part = "body", bg = "red") Hospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
Or, we can higlight the entire row meeting a certain criterion, such as a hospital of interest. This is particularly helpful when looping through e.g. reports per geographical area, to highlight in tables where the current iteration compares to the other geographies. To do this we just remove the column (j) specification.
ftable %>%
bg(., j=c(1:8), i= ~ hospital == "Military Hospital", part = "body", bg = "#91c293") Hospital | Total cases with known outcome | Recovered | Died | ||||
Total | % of cases | Median CT values | Total | % of cases | Median CT values | ||
Port Hospital | 1,364 | 579 | 42.4 | 21.0 | 785 | 57.6 | 22.0 |
Military Hospital | 708 | 309 | 43.6 | 21.0 | 399 | 56.4 | 22.0 |
Other | 685 | 290 | 42.3 | 21.0 | 395 | 57.7 | 22.0 |
Central Hospital | 358 | 165 | 46.1 | 22.0 | 193 | 53.9 | 22.0 |
St. Mark's Maternity Hospital (SMMH) | 325 | 126 | 38.8 | 21.0 | 199 | 61.2 | 22.0 |
You can export the tables to Word, PowerPoint or HTML or as an image (PNG) files. To do this, one of the following functions is used:
For instance:
save_as_docx("my table" = ftable, path = "file.docx")
# Edit the 'my table' as needed for the title of table. If not specified the whole file will be blank.
save_as_image(ftable, path = "file.png")## [1] "C:/Users/Neale/OneDrive - Neale Batra/Documents/Analytic Software/R/Projects/R handbook/Epi_R_handbook/file.png"
Note the packages webshot or webshot2 are required to save a flextable as an image.Images may come out with transparent backgrounds.
If you want to view a ‘live’ versions of the flextable output in the intended document format, for instance so you can see if it fits in the page or so you can copy it into another document, you can use the print method with the argument preview set to “pptx” or “docx”. The document will pop up.
print(ftable, preview = "docx") # Word document example
print(ftable, preview = "pptx") # Powerpoint example
The full flextable explanation is here: https://ardata-fr.github.io/flextable-book/
Age pyramids can be useful to show patterns by age group. They can show gender, or the distribution of other characteristics.
These tabs demonstrate how to produce age pyramids using:
ggplot()Age/gender demographic pyramids in R are generally made with ggplot() by creating two barplots (one for each gender), converting one’s values to negative values, and flipping the x and y axes to display the barplots vertically.
Here we offer a quick approach through the apyramid package:
ggplot() commandsFor this tab we use the linelist dataset that is cleaned in the Cleaning tab.
To make a traditional age/sex demographic pyramid, the data must first be cleaned in the following ways:
Load packages
First, load the packages required for this analysis:
pacman::p_load(rio, # to import data
here, # to locate files
tidyverse, # to clean, handle, and plot the data (includes ggplot2 package)
apyramid, # a package dedicated to creating age pyramids
stringr) # working with strings for titles, captions, etc.Load the data
linelist <- rio::import("linelist_cleaned.csv")Check class of variables
Ensure that the age variable is class Numeric, and check the class and order of levels of age_cat and age_cat5
class(linelist$age_years)## [1] "numeric"
class(linelist$age_cat)## [1] "factor"
class(linelist$age_cat5)## [1] "factor"
table(linelist$age_cat, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ <NA>
## 1090 1060 918 835 1067 727 96 7 88
table(linelist$age_cat5, useNA = "always")##
## 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85+ <NA>
## 1090 1060 918 835 643 424 292 187 147 101 53 18 17 8 5 2 0 0 88
The package apyramid allows you to quickly make an age pyramid. For more nuanced situations, see the tab on using ggplot() to make age pyramids. You can read more about the apyramid package in its Help page by entering ?age_pyramid in your R console.
Using the cleaned linelist dataset, we can create an age pyramid with just one simple command. If you need help cleaning your data, see the handbook page on Cleaning data (LINK). In this command:
linelist dataframeage_cat5)apyramid::age_pyramid(data = linelist,
age_group = "age_cat5",
split_by = "gender")## Warning: 286 missing rows were removed (88 values from `age_cat5` and 286 values from `gender`).
When using agepyramid package, if the
split_by column is binary (e.g. male/female, or yes/no), then the result will appear as a pyramid. However if there are more than two values in the split_by column (not including NA), the pyramid will appears as a faceted barplot with empty bars in the background indicating the range of the un-faceted data set for the age group. Values of split_by will appear as labels at top of each facet. For example below if the split_by variable is “hospital”.
apyramid::age_pyramid(data = linelist,
age_group = "age_cat5",
split_by = "hospital",
na.rm = FALSE) # show a bar for patients missing age, (note: this changes the pyramid into a faceted barplot)Missing values
Rows missing values for the split_by or age_group columns, if coded as NA, will not trigger the faceting shown above. By default these rows will not be shown. However you can specify that they appear, in an adjacent barplot and as a separate age group at the top, by specifying na.rm = FALSE.
apyramid::age_pyramid(data = linelist,
age_group = "age_cat5",
split_by = "gender",
na.rm = FALSE) # show patients missing age or genderProportions, colors, & aesthetics
By default, the bars display counts (not %), a dashed mid-line for each group is shown, and the colors are green/purple. Each of these parameters can all be adjusted, as shown below:
You can also add additional ggplot() commands to the plot using the standard ggplot() “+” syntax, such as aesthetic themes and label adjustments:
apyramid::age_pyramid(data = linelist,
age_group = "age_cat5",
split_by = "gender",
proportional = TRUE, # show percents, not counts
show_midpoint = FALSE, # remove bar mid-point line
#pal = c("orange", "purple") # can specify alt. colors here (but not labels, see below)
)+
# additional ggplot commands
theme_minimal()+ # simplify the background
scale_fill_manual(values = c("orange", "purple"), # to specify colors AND labels
labels = c("Male", "Female"))+
labs(y = "Percent of all cases", # note that x and y labels are switched (see ggplot tab)
x = "Age categories",
fill = "Gender",
caption = "My data source and caption here",
title = "Title of my plot",
subtitle = "Subtitle with \n a second line...")+
theme(
legend.position = "bottom", # move legend to bottom
axis.text = element_text(size = 10, face = "bold"), # fonts/sizes, see ggplot tips page
axis.title = element_text(size = 12, face = "bold"))## Warning: 286 missing rows were removed (88 values from `age_cat5` and 286 values from `gender`).
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
The examples above assume your data are in a linelist-like format, with one row per observation. If your data are already aggregated into counts by age category, you can still use the apyramid package, as shown below.
Let’s say that your dataset looks like this, with columns for age category, and male counts, female counts, and missing counts.
(see the handbook page on Transforming data for tips)
## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
# View the aggregated data
DT::datatable(demo_agg, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )ggplot() perfers data in “long” format, so first pivot the data to be “long” with the pivot_longer() function from dplyr.
# pivot the aggregated data into long format
demo_agg_long <- demo_agg %>%
pivot_longer(c(f, m, missing_gender), # cols to elongate
names_to = "gender", # name for new col of categories
values_to = "counts") %>% # name for new col of counts
mutate(gender = na_if(gender, "missing_gender")) # convert "missing_gender" to NA# View the aggregated data
DT::datatable(demo_agg_long, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Then use the split_by and count arguments of age_pyramid() to specify the respective columns:
apyramid::age_pyramid(data = demo_agg_long,
age_group = "age_cat5",
split_by = "gender",
count = "counts") # give the column name for the aggregated counts## Warning: Removed 24 rows containing missing values (position_stack).
## Warning: Removed 19 rows containing missing values.
Note in the above, that the factor order of “m” and “f” is different (pyramid reversed). To adjust the order you must re-define gender in the aggredated data as a Factor and order the levels as desired.
ggplot()Using ggplot() to build your age pyramid allows for more flexibility, but requires more effort and understanding of how ggplot() works. It is also easier to accidentally make mistakes.
apyramid uses ggplot() in the background (and accepts ggplot() commands added), but this page shows how to adjust or recreate a pyramid only using ggplot(), if you wish.
First, understand that to make such a pyramid using ggplot() the approach is to:
Within the ggplot(), create two graphs by age category. Create one for each of the two grouping values (in this case gender). See filters applied to the data arguments in each geom_histogram() commands below.
If using geom_histogram(), the graphs operate off the numeric column (e.g. age_years), whereas if using geom_barplot() the graphs operate from an ordered Factor (e.g. age_cat5).
One graph will have positive count values, while the other will have its counts converted to negative values - this allows both graphs to be seen and compared against each other in the same plot.
The command coord_flip() switches the X and Y axes, resulting in the graphs turning vertical and creating the pyramid.
Lastly, the counts-axis labels must be specified so they appear as “positive” counts on both sides of the pyramid (despite the underlying values on one side being negative).
A simple version of this, using geom_histogram(), is below:
# begin ggplot
ggplot(data = linelist, aes(x = age, fill = gender)) +
# female histogram
geom_histogram(data = filter(linelist, gender == "f"),
breaks = seq(0,85,5),
colour = "white") +
# male histogram (values converted to negative)
geom_histogram(data = filter(linelist, gender == "m"),
breaks = seq(0,85,5),
aes(y=..count..*(-1)),
colour = "white") +
# flip the X and Y axes
coord_flip() +
# adjust counts-axis scale
scale_y_continuous(limits = c(-600, 900),
breaks = seq(-600,900,100),
labels = abs(seq(-600, 900, 100)))DANGER: If the limits of your counts axis are set too low, and a counts bar exceeds them, the bar will disappear entirely or be artificially shortened! Watch for this if analyzing data which is routinely updated. Prevent it by having your count-axis limits auto-adjust to your data, as below.
There are many things you can change/add to this simple version, including:
# create dataset with proportion of total
pyramid_data <- linelist %>%
group_by(age_cat5, gender) %>%
summarize(counts = n()) %>%
ungroup() %>%
mutate(percent = round(100*(counts / sum(counts, na.rm=T)),1),
percent = case_when(
gender == "f" ~ percent,
gender == "m" ~ -percent,
TRUE ~ NA_real_))## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)
# begin ggplot
ggplot()+ # default x-axis is age in years;
# case data graph
geom_bar(data = pyramid_data,
stat = "identity",
aes(x = age_cat5,
y = percent,
fill = gender), #
colour = "white")+ # white around each bar
# flip the X and Y axes to make pyramid vertical
coord_flip()+
# adjust the axes scales (remember they are flipped now!)
#scale_x_continuous(breaks = seq(0,100,5), labels = seq(0,100,5)) +
scale_y_continuous(limits = c(min_per, max_per),
breaks = seq(floor(min_per), ceiling(max_per), 2),
labels = paste0(abs(seq(floor(min_per), ceiling(max_per), 2)), "%"))+
# designate colors and legend labels manually
scale_fill_manual(
values = c("f" = "orange",
"m" = "darkgreen"),
labels = c("Female", "Male"),
) +
# label values (remember X and Y flipped now)
labs(
x = "Age group",
y = "Percent of total",
fill = NULL,
caption = stringr::str_glue("Data are from linelist \nn = {nrow(linelist)} (age or sex missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases) \nData as of: {format(Sys.Date(), '%d %b %Y')}")) +
# optional aesthetic themes
theme(
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust=0, size=11, face = "italic")) +
ggtitle(paste0("Age and gender of cases"))## Warning: Removed 12 rows containing missing values (position_stack).
With the flexibility of ggplot(), you can have a second layer of bars in the background that represent the true population pyramid. This can provide a nice visualization to compare the observed counts with the baseline.
Import and view the population data
# import the population demographics data
pop <- rio::import("country_demographics.csv")# display the linelist data as a table
DT::datatable(pop, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )First some data management steps:
Here we record the order of age categories that we want to appear. Due to some quirks the way the ggplot() is implemented, it is easiest to store these as a character vector and use them later in the plotting function.
# record correct age cat levels
age_levels <- c("0-4","5-9", "10-14", "15-19", "20-24",
"25-29","30-34", "35-39", "40-44", "45-49",
"50-54", "55-59", "60-64", "65-69", "70-74",
"75-79", "80-84", "85+")Combine the population and case data through the dplyr function bind_rows():
bind_rows())# create/transform populaton data, with percent of total
########################################################
pop_data <- pivot_longer(pop, c(m, f), names_to = "gender", values_to = "counts") %>% # pivot gender columns longer
mutate(data = "population", # add column designating data source
percent = round(100*(counts / sum(counts, na.rm=T)),1), # calculate % of total
percent = case_when( # if male, convert % to negative
gender == "f" ~ percent,
gender == "m" ~ -percent,
TRUE ~ NA_real_))Review the changed population dataset
# display the linelist data as a table
DT::datatable(pop_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Now implement the same for the case linelist. Slightly different because it begins with case-rows, not counts.
# create case data by age/gender, with percent of total
#######################################################
case_data <- linelist %>%
group_by(age_cat5, gender) %>% # aggregate linelist cases into age-gender groups
summarize(counts = n()) %>% # calculate counts per age-gender group
ungroup() %>%
mutate(data = "cases", # add column designating data source
percent = round(100*(counts / sum(counts, na.rm=T)),1), # calculate % of total for age-gender groups
percent = case_when( # convert % to negative if male
gender == "f" ~ percent,
gender == "m" ~ -percent,
TRUE ~ NA_real_))## `summarise()` has grouped output by 'age_cat5'. You can override using the `.groups` argument.
Review the changed case dataset
# display the linelist data as a table
DT::datatable(case_data, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )Now the two datasets are combined, one on top of the other (same column names)
# combine case and population data (same column names, age_cat values, and gender values)
pyramid_data <- bind_rows(case_data, pop_data)Store the maximum and minimum percent values, used in the plotting funtion to define the extent of the plot (and not cut off any bars!)
# Define extent of percent axis, used for plot limits
max_per <- max(pyramid_data$percent, na.rm=T)
min_per <- min(pyramid_data$percent, na.rm=T)Now the plot is made with ggplot():
# begin ggplot
##############
ggplot()+ # default x-axis is age in years;
# population data graph
geom_bar(data = filter(pyramid_data, data == "population"),
stat = "identity",
aes(x = age_cat5,
y = percent,
fill = gender),
colour = "black", # black color around bars
alpha = 0.2, # more transparent
width = 1)+ # full width
# case data graph
geom_bar(data = filter(pyramid_data, data == "cases"),
stat = "identity", # use % as given in data, not counting rows
aes(x = age_cat5, # age categories as original X axis
y = percent, # % as original Y-axis
fill = gender), # fill of bars by gender
colour = "black", # black color around bars
alpha = 1, # not transparent
width = 0.3)+ # half width
# flip the X and Y axes to make pyramid vertical
coord_flip()+
# adjust axes order, scale, and labels (remember X and Y axes are flipped now)
# manually ensure that age-axis is ordered correctly
scale_x_discrete(limits = age_levels)+
# set percent-axis
scale_y_continuous(limits = c(min_per, max_per), # min and max defined above
breaks = seq(floor(min_per), ceiling(max_per), by = 2), # from min% to max% by 2
labels = paste0( # for the labels, paste together...
abs(seq(floor(min_per), ceiling(max_per), by = 2)), # ...rounded absolute values of breaks...
"%"))+ # ... with "%"
# floor(), ceiling() round down and up
# designate colors and legend labels manually
scale_fill_manual(
values = c("f" = "orange", # assign colors to values in the data
"m" = "darkgreen"),
labels = c("f" = "Female",
"m"= "Male"), # change labels that appear in legend, note order
) +
# plot labels, titles, caption
labs(
title = "Case age and gender distribution,\nas compared to baseline population",
subtitle = "",
x = "Age category",
y = "Percent of total",
fill = NULL,
caption = stringr::str_glue("Cases shown on top of country demographic baseline\nCase data are from linelist, n = {nrow(linelist)}\nAge or gender missing for {sum(is.na(linelist$gender) | is.na(linelist$age_years))} cases\nCase data as of: {format(max(linelist$date_onset, na.rm=T), '%d %b %Y')}")) +
# optional aesthetic themes
theme(
legend.position = "bottom", # move legend to bottom
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
plot.title = element_text(hjust = 0),
plot.caption = element_text(hjust=0, size=11, face = "italic"))## Warning: Removed 12 rows containing missing values (position_stack).
The techniques used to make a population pyramid with ggplot() can also be used to make plots of Likert-scale survey data.
Import the data
# import the likert survey response data
likert_data <- rio::import("likert_data.csv")Start with data that looks like this, with a categorical classification of each respondent (status) and their answers to 8 questions on a 4-point Likert-type scale (“Very poor”, “Poor”, “Good”, “Very good”).
# display the linelist data as a table
DT::datatable(likert_data, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )First, some data management steps:
direction depending on whether response was generally “positive” or “negative”status column and the Response columnmelted <- pivot_longer(likert_data, Q1:Q8, names_to = "Question", values_to = "Response") %>%
mutate(direction = case_when(
Response %in% c("Poor","Very Poor") ~ "Negative",
Response %in% c("Good", "Very Good") ~ "Positive",
TRUE ~ "Unknown"),
status = factor(status, levels = rev(c(
"Senior", "Intermediate", "Junior"))),
Response = factor(Response, levels = c("Very Good", "Good",
"Very Poor", "Poor"))) # must reverse Very Poor and Poor for ordering to work
melted_max <- melted %>%
group_by(status, Question) %>%
summarize(n = n())## `summarise()` has grouped output by 'status'. You can override using the `.groups` argument.
melted_max <- max(melted_max$n, na.rm=T)Now make the plot:
# make plot
ggplot()+
# bar graph of the "negative" responses
geom_bar(data = filter(melted,
direction == "Negative"),
aes(x = status,
y=..count..*(-1), # counts inverted to negative
fill = Response),
color = "black",
closed = "left",
position = "stack")+
# bar graph of the "positive responses
geom_bar(data = filter(melted, direction == "Positive"),
aes(x = status, fill = Response),
colour = "black",
closed = "left",
position = "stack")+
# flip the X and Y axes
coord_flip()+
# Black vertical line at 0
geom_hline(yintercept = 0, color = "black", size=1)+
# convert labels to all positive numbers
scale_y_continuous(limits = c(-ceiling(melted_max/10)*11, ceiling(melted_max/10)*10), # seq from neg to pos by 10, edges rounded outward to nearest 5
breaks = seq(-ceiling(melted_max/10)*10, ceiling(melted_max/10)*10, 10),
labels = abs(unique(c(seq(-ceiling(melted_max/10)*10, 0, 10),
seq(0, ceiling(melted_max/10)*10, 10))))) +
# color scales manually assigned
scale_fill_manual(values = c("Very Good" = "green4", # assigns colors
"Good" = "green3",
"Poor" = "yellow",
"Very Poor" = "red3"),
breaks = c("Very Good", "Good", "Poor", "Very Poor"))+ # orders the legend
# facet the entire plot so each question is a sub-plot
facet_wrap(~Question, ncol = 3)+
# labels, titles, caption
labs(x = "Respondent status",
y = "Number of responses",
fill = "")+
ggtitle(str_glue("Likert-style responses\nn = {nrow(likert_data)}"))+
# aesthetic settings
theme_minimal()+
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 14, face = "bold"),
strip.text = element_text(size = 14, face = "bold"), # facet sub-titles
plot.title = element_text(size = 20, face = "bold"),
panel.background = element_rect(fill = NA, color = "black")) # black box around each facet## Warning: Ignoring unknown parameters: closed
## Warning: Ignoring unknown parameters: closed
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
This page covers:
Load packages
pacman::p_load(
DiagrammeR, # for flow diagrams
networkD3 # For alluvial/Sankey diagrams
)One can use the R package DiagrammeR to create charts/flow charts. They can be static, or they can adjust somewhat dynamically based on changes in a dataset.
Tools
The function grViz() is used to create a “Graphviz” diagram. This function accepts a character string input containing instructions for making the diagram. Within that string, the instructions are written in a different language, called DOT - it is quite easy to learn the basics.
Basic structure
grViz("digraph my_flow_chart {}")Below are two simple examples
A very minimal example:
# A minimal plot
DiagrammeR::grViz("digraph {
graph[layout = dot, rankdir = LR]
a
b
c
a -> b -> c
}")An example with applied public health context:
grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes
#######
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
Primary # names of nodes
Secondary
Tertiary
# edges
#######
Primary -> Secondary [label = 'case transfer']
Secondary -> Tertiary [label = 'case transfer']
}
")Basic syntax
Node names, or edge statements, can be separated with spaces, semicolons, or newlines.
Rank direction
A plot can be re-oriented to move left-to-right by adjusting the rankdir argument within the graph statement. The default is TB (top-to-bottom), but it can be LR (left-to-right), RL, or BT.
Node names
Node names can be single words, as in the simple example above. To use multi-word names or special characters (e.g. parentheses, dashes), put the node name within single quotes (’ ’). It may be easier to have a short node name, and assign a label, as shown below within brackets [ ]. A label is also necessary to have a newline within the node name - use \n in the node label within single quotes, as shown below.
Subgroups
Within edge statements, subgroups can be created on either side of the edge with curly brackets ({ }). The edge then applies to all nodes in the bracket - it is a shorthand.
Layouts
rankdir to either TB, LR, RL, BT, )Nodes - editable attributes
label (text, in single quotes if multi-word)fillcolor (many possible colors)fontcoloralpha (transparency 0-1)shape (ellipse, oval, diamond, egg, plaintext, point, square, triangle)stylesidesperipheriesfixedsize (h x w)heightwidthdistortionpenwidth (width of shape border)x (displacement left/right)y (displacement up/down)fontnamefontsizeiconEdges - editable attributes
arrowsizearrowhead (normal, box, crow, curve, diamond, dot, inv, none, tee, vee)arrowtaildir (direction, )style (dashed, …)coloralphaheadport (text in front of arrowhead)tailport (text in behind arrowtail)fontnamefontsizefontcolorpenwidth (width of arrow)minlen (minimum length)Color names: hexadecimal values or ‘X11’ color names, see here for X11 details
The example below expands on the surveillance_diagram, adding complex node names, grouped edges, colors and styling
DiagrammeR::grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB, # layout top-to-bottom
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3]
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
# edges
#######
Primary -> Secondary [label = 'case transfer',
fontcolor = red,
color = red]
Secondary -> Tertiary [label = 'case transfer',
fontcolor = red,
color = red]
# grouped edge
{Primary Secondary Tertiary} -> SC [label = 'case reporting',
fontcolor = darkgreen,
color = darkgreen,
style = dashed]
}
")
Sub-graph clusters
To group nodes into boxed clusters, put them within the same named subgraph (subgraph name {}). To have each subgraph identified within a bounding box, begin the name of the subgraph with “cluster”, as shown with the 4 boxes below.
DiagrammeR::grViz(" # All instructions are within a large character string
digraph surveillance_diagram { # 'digraph' means 'directional graph', then the graph name
# graph statement
#################
graph [layout = dot,
rankdir = TB,
overlap = true,
fontsize = 10]
# nodes (circles)
#################
node [shape = circle, # shape = circle
fixedsize = true
width = 1.3] # width of circles
subgraph cluster_passive {
Primary [label = 'Primary\nFacility']
Secondary [label = 'Secondary\nFacility']
Tertiary [label = 'Tertiary\nFacility']
SC [label = 'Surveillance\nCoordination',
fontcolor = darkgreen]
}
# nodes (boxes)
###############
node [shape = box, # node shape
fontname = Helvetica] # text font in node
subgraph cluster_active {
Active [label = 'Active\nSurveillance'];
HCF_active [label = 'HCF\nActive Search']
}
subgraph cluster_EBD {
EBS [label = 'Event-Based\nSurveillance (EBS)'];
'Social Media'
Radio
}
subgraph cluster_CBS {
CBS [label = 'Community-Based\nSurveillance (CBS)'];
RECOs
}
# edges
#######
{Primary Secondary Tertiary} -> SC [label = 'case reporting']
Primary -> Secondary [label = 'case transfer',
fontcolor = red]
Secondary -> Tertiary [label = 'case transfer',
fontcolor = red]
HCF_active -> Active
{'Social Media'; Radio} -> EBS
RECOs -> CBS
}
")
Node shapes
The example below, borrowed from this tutorial, shows applied node shapes and a shorthand for serial edge connections
DiagrammeR::grViz("digraph {
graph [layout = dot, rankdir = LR]
# define the global styles of the nodes. We can override these in box if we wish
node [shape = rectangle, style = filled, fillcolor = Linen]
data1 [label = 'Dataset 1', shape = folder, fillcolor = Beige]
data2 [label = 'Dataset 2', shape = folder, fillcolor = Beige]
process [label = 'Process \n Data']
statistical [label = 'Statistical \n Analysis']
results [label= 'Results']
# edge definitions with the node IDs
{data1 data2} -> process -> statistical -> results
}")How to handle and save outputs
“Parameterized figures: A great benefit of designing figures within R is that we are able to connect the figures directly with our analysis by reading R values directly into our flowcharts. For example, suppose you have created a filtering process which removes values after each stage of a process, you can have a figure show the number of values left in the dataset after each stage of your process. To do this we, you can use the @@X symbol directly within the figure, then refer to this in the footer of the plot using [X]:, where X is the a unique numeric index. Here is a basic example:”
https://mikeyharper.uk/flowcharts-in-r-using-diagrammer/
# Define some sample data
data <- list(a=1000, b=800, c=600, d=400)
DiagrammeR::grViz("
digraph graph2 {
graph [layout = dot]
# node definitions with substituted label text
node [shape = rectangle, width = 4, fillcolor = Biege]
a [label = '@@1']
b [label = '@@2']
c [label = '@@3']
d [label = '@@4']
a -> b -> c -> d
}
[1]: paste0('Raw Data (n = ', data$a, ')')
[2]: paste0('Remove Errors (n = ', data$b, ')')
[3]: paste0('Identify Potential Customers (n = ', data$c, ')')
[4]: paste0('Select Top Priorities (n = ', data$d, ')')
")Much of the above is adapted from the tutorial at this site
Other more in-depth tutorial: http://rich-iannone.github.io/DiagrammeR/
Note above is out of date via DiagrammeR
Load packages
pacman::p_load(networkD3)Plotting the connections in a dataset
https://www.r-graph-gallery.com/321-introduction-to-interactive-sankey-diagram-2.html
Counts of age category and hospital, relabled as target and source, respectively.
# counts by hospital and age category
links <- linelist %>%
select(hospital, age_cat) %>%
count(hospital, age_cat) %>%
rename(source = hospital,
target = age_cat)Now formalize the nodes list, and adjust the ID columns to be numbers instead of labels:
# The unique node names
nodes <- data.frame(
name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1Now plot the Sankey diagram:
# plot
######
p <- sankeyNetwork(Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30)
pHere is an example where the patient Outome is included as well. Note in the data management step how we bind rows of counts of hospital -> outcome, using the same column names.
# counts by hospital and age category
links <- linelist %>%
select(hospital, age_cat) %>%
mutate(age_cat = stringr::str_glue("Age {age_cat}")) %>%
count(hospital, age_cat) %>%
rename(source = age_cat,
target = hospital) %>%
bind_rows(
linelist %>%
select(hospital, outcome) %>%
count(hospital, outcome) %>%
rename(source = hospital,
target = outcome)
)
# The unique node names
nodes <- data.frame(
name=c(as.character(links$source), as.character(links$target)) %>%
unique()
)
# match to numbers, not names
links$IDsource <- match(links$source, nodes$name)-1
links$IDtarget <- match(links$target, nodes$name)-1
# plot
######
p <- sankeyNetwork(Links = links,
Nodes = nodes,
Source = "IDsource",
Target = "IDtarget",
Value = "n",
NodeID = "name",
units = "TWh",
fontSize = 12,
nodeWidth = 30)
phttps://www.displayr.com/sankey-diagrams-r/
Timeline Sankey - LTFU from cohort… application/rejections… etc.
To make a timeline showing specific events, you can use the vistime package.
See this vignette
# load package
pacman::p_load(vistime, # make the timeline
plotly # for interactive visualization
)Here is the events dataset we begin with:
p <- vistime(data) # apply vistime
library(plotly)
# step 1: transform into a list
pp <- plotly_build(p)
# step 2: Marker size
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "markers") pp$x$data[[i]]$marker$size <- 10
}
# step 3: text size
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textfont$size <- 10
}
# step 4: text position
for(i in 1:length(pp$x$data)){
if(pp$x$data[[i]]$mode == "text") pp$x$data[[i]]$textposition <- "right"
}
#print
ppYou can build a DAG manually using the DiagammeR package and DOT language, as described in another tab. Alternatively, there are packages like ggdag and daggity
https://cran.r-project.org/web/packages/ggdag/vignettes/intro-to-dags.html
https://www.r-bloggers.com/2019/08/causal-inference-with-dags-in-r/#:~:text=In%20a%20DAG%20all%20the,for%20drawing%20and%20analyzing%20DAGs.
Links to other online tutorials or resources.
This analysis plots the frequency of different combinations of values/responses. In this example, we plot the frequency of symptom combinations.
This analysis is often called:
Multiple response analysis
Sets analysis
Combinations analysis
The first method shown uses the package ggupset, an the second using the package UpSetR.
An example plot is below. Five symptoms are shown. Below each vertical bar is a line and dots indicating the combination of symptoms reflected by the bar above. To the right, horizontal bars reflect the frequency of each individual symptom.
pacman::p_load(tidyverse,
UpSetR,
ggupset)This linelist includes five “yes/no” variables on reported symptoms. We will need to transform these variables a bit to use the ggupset package to make our plot.
View the data (scroll to the right to see the symptoms variables)
## Warning in instance$preRenderHook(instance): It seems your data is too big for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
We convert the “yes” and “no the the actual symptom name. If”no", we set the value as blank.
# create column with the symptoms named, separated by semicolons
linelist_sym_1 <- linelist_sym %>%
# convert the "yes" and "no" values into the symptom name itself
mutate(fever = case_when(fever == "yes" ~ "fever", # if old value is "yes", new value is "fever"
TRUE ~ NA_character_), # if old value is anything other than "yes", the new value is NA
chills = case_when(chills == "yes" ~ "chills",
TRUE ~ NA_character_),
cough = case_when(cough == "yes" ~ "cough",
TRUE ~ NA_character_),
aches = case_when(aches == "yes" ~ "aches",
TRUE ~ NA_character_),
shortness_of_breath = case_when(shortness_of_breath == "yes" ~ "shortness_of_breath",
TRUE ~ NA_character_))Now we make two final variables:
1. Pasting together all the symptoms of the patient (character variable)
2. Convert the above to class list, so it can be accepted by ggupset to make the plot
linelist_sym_1 <- linelist_sym_1 %>%
mutate(
# combine the variables into one, using paste() with a semicolon separating any values
all_symptoms = paste(fever, chills, cough, aches, shortness_of_breath, sep = "; "),
# make a copy of all_symptoms variable, but of class "list" (which is required to use ggupset() in next step)
all_symptoms_list = as.list(strsplit(all_symptoms, "; "))
)View the new data. Note the two columns at the end - the pasted combined values, and the list
DT::datatable(linelist_sym, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T))## Warning in instance$preRenderHook(instance): It seems your data is too big for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
ggupsetLoad required package to make the plot (ggupset)
pacman::p_load(ggupset)Create the plot:
ggplot(linelist_sym_1,
aes(x=all_symptoms_list)) +
geom_bar() +
scale_x_upset(reverse = FALSE,
n_intersections = 10,
sets = c("fever", "chills", "cough", "aches", "shortness_of_breath")
)+
labs(title = "Signs & symptoms",
subtitle = "10 most frequent combinations of signs and symptoms",
caption = "Caption here.",
x = "Symptom combination",
y = "Frequency in dataset")## Warning: Removed 748 rows containing non-finite values (stat_count).
More information on ggupset can be found online or offline in the package documentation in your RStudio Help tab.
UpSetRThe UpSetR package allows more customization, but it more difficult to execute:
https://github.com/hms-dbmi/UpSetR read this https://gehlenborglab.shinyapps.io/upsetr/ Shiny App version - you can upload your own data https://cran.r-project.org/web/packages/UpSetR/UpSetR.pdf documentation - difficult to interpret
pacman::p_load(UpSetR)Convert symptoms variables to 1/0.
# Make using upSetR
linelist_sym_2 <- linelist_sym %>%
# convert the "yes" and "no" values into the symptom name itself
mutate(fever = case_when(fever == "yes" ~ 1, # if old value is "yes", new value is "fever"
TRUE ~ 0), # if old value is anything other than "yes", the new value is NA
chills = case_when(chills == "yes" ~ 1,
TRUE ~ 0),
cough = case_when(cough == "yes" ~ 1,
TRUE ~ 0),
aches = case_when(aches == "yes" ~ 1,
TRUE ~ 0),
shortness_of_breath = case_when(shortness_of_breath == "yes" ~ 1,
TRUE ~ 0))Now make the plot, using only the symptom variables. Must designate which “sets” to compare (the names of the symptom variables).
Alternatively use nsets = and order.by = "freq" to only show the top X combinations.
# Make the plot
UpSetR::upset(
select(linelist_sym_2, fever, chills, cough, aches, shortness_of_breath),
sets = c("fever", "chills", "cough", "aches", "shortness_of_breath"),
order.by = "freq",
sets.bar.color = c("blue", "red", "yellow", "darkgreen", "orange"), # optional colors
empty.intersections = "on",
# nsets = 3,
number.angles = 0,
point.size = 3.5,
line.size = 2,
mainbar.y.label = "Symptoms Combinations",
sets.x.label = "Patients with Symptom")This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Heatmaps can be useful when tracking reporting metrics across many facilities/jurisdictions over time
For example, the image below shows % of weekdays that data was received from each facility, week-by-week:
Load packages
pacman::p_load(tidyverse, # data manipulation and visualization
rio, # importing data
OpenStreetMap, # optional - basemaps
aweek
)Often in public health, an objective is to assess trends over time for many entities (facilities, jurisdictions, etc.). One way to visualize trends over time from many entities is a heatmap where the x-axis is time and the y-axis are the many entities.
To demonstrate this, we import this dataset of daily malaria case reports from 65 facilities.
The preparation will involve:
Below are the first 30 rows of these data:
The packages we will use are:
pacman::p_load(tidyverse, # ggplot and data manipulation
rio, # importing data
aweek) # manage weeksThe objective is to transform the daily reports (seen in previous tab) into weekly reports with a summary of performance - in this case the proportion of days per week that the facility reported any data for Spring District from April-May 2019.
To achieve this:
date2week() from package aweek
floor_day = argument means that dates are rounded into the week only (day of the week is not shown)factor = argument converts the new column to a factor - important because all possible weeks within the date range are designated as levels, even if there is no data for them currently.summarize() creates new columns to calculate reporting performance for each “facility-week”:
right_join()) to a comprehensive list of all possible facility-week combinations, to make the dataset complete. The matrix of all possible combinations is created by applying expand() to those two columns of the dataframe as it is at that moment in the pipe chain (represented by “.”). Because a right_join() is used, all rows in the expand() dataframe are kept, and added to agg_weeks if necessary. These new rows appear with NA (missing) summarized values.# Create weekly summary dataset
agg_weeks <- facility_count_data %>%
# filter the data as appropriate
filter(District == "Spring",
data_date < as.Date("2019-06-01")) %>%
# Create week column from data_date
mutate(week = aweek::date2week(data_date,
start_date = "Monday",
floor_day = TRUE,
factor = TRUE)) %>%
# Group into facility-weeks
group_by(location_name, week, .drop = F) %>%
# Create summary column on the grouped data
summarize(n_days = 7, # 7 days per week
n_reports = dplyr::n(), # number of reports received per week (could be >7)
malaria_tot = sum(malaria_tot, na.rm = T), # total malaria cases reported
n_days_reported = length(unique(data_date)), # number of unique days reporting per week
p_days_reported = round(100*(n_days_reported / n_days))) %>% # percent of days reporting
# Ensure every possible facility-week combination appears in the data
right_join(tidyr::expand(., week, location_name)) # "." represents the dataset at that moment in the pipe chainThe ggplot() is make using geom_tile():
scale_x_date()fill is the performance for that facility-week (numeric)scale_fill_gradient() is used on the numeric fill, specifying colors for high, low, and NAscale_x_date() is used on the x-axis specifying labels every 2 weeks and their formatggplot(agg_weeks,
aes(x = aweek::week2date(week), # transformed to date class
y = location_name,
fill = p_days_reported))+
# tiles
geom_tile(colour="white")+ # white gridlines
scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
scale_x_date(expand = c(0,0),
date_breaks = "2 weeks",
date_labels = "%d\n%b")+
# aesthetic themes
theme_minimal()+ # simplify background
theme(
legend.title = element_text(size=12, face="bold"),
legend.text = element_text(size=10, face="bold"),
legend.key.height = grid::unit(1,"cm"), # height of legend key
legend.key.width = grid::unit(0.6,"cm"), # width of legend key
axis.text.x = element_text(size=12),
axis.text.y = element_text(vjust=0.2),
axis.ticks = element_line(size=0.4),
axis.title = element_text(size=12, face="bold"),
plot.title = element_text(hjust=0,size=14,face="bold"),
plot.caption = element_text(hjust = 0, face = "italic")
)+
# plot labels
labs(x = "Week",
y = "Facility name",
fill = "Reporting\nperformance (%)", # legend title
title = "Percent of days per week that facility reported data",
subtitle = "District health facilities, April-May 2019",
caption = "7-day weeks beginning on Mondays.")If you want to order the y-axis facilities by something, convert them to class Factor and provide the order. Below, the order is set based on the total number of reporting days filed by the facility across the whole timespan:
facility_order <- agg_weeks %>%
group_by(location_name) %>%
summarize(tot_reports = sum(n_days_reported, na.rm=T)) %>%
arrange(tot_reports) # ascending orderas.tibble(facility_order)## # A tibble: 15 x 2
## location_name tot_reports
## <chr> <int>
## 1 Facility 56 1
## 2 Facility 65 6
## 3 Facility 11 19
## 4 Facility 39 31
## 5 Facility 59 33
## 6 Facility 27 40
## 7 Facility 32 41
## 8 Facility 51 41
## 9 Facility 7 42
## 10 Facility 1 46
## 11 Facility 9 48
## 12 Facility 35 50
## 13 Facility 50 51
## 14 Facility 58 53
## 15 Facility 28 75
Now use the above vector (facility_order$location_name) to be the order of the factor levels of location_name in the dataset agg_weeks:
agg_weeks <- agg_weeks %>%
mutate(location_name = factor(location_name, levels = facility_order$location_name))And now the data are re-plotted, with location_name being an ordered factor:
ggplot(agg_weeks,
aes(x = aweek::week2date(week), # transformed to date class
y = location_name,
fill = p_days_reported))+
# tiles
geom_tile(colour="white")+ # white gridlines
scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
scale_x_date(expand = c(0,0),
date_breaks = "2 weeks",
date_labels = "%d\n%b")+
# aesthetic themes
theme_minimal()+ # simplify background
theme(
legend.title = element_text(size=12, face="bold"),
legend.text = element_text(size=10, face="bold"),
legend.key.height = grid::unit(1,"cm"), # height of legend key
legend.key.width = grid::unit(0.6,"cm"), # width of legend key
axis.text.x = element_text(size=12),
axis.text.y = element_text(vjust=0.2),
axis.ticks = element_line(size=0.4),
axis.title = element_text(size=12, face="bold"),
plot.title = element_text(hjust=0,size=14,face="bold"),
plot.caption = element_text(hjust = 0, face = "italic")
)+
# plot labels
labs(x = "Week",
y = "Facility name",
fill = "Reporting\nperformance (%)", # legend title
title = "Percent of days per week that facility reported data",
subtitle = "District health facilities, April-May 2019",
caption = "7-day weeks beginning on Mondays.")You can add a geom_text() layer on top of the tiles, to display the actual numbers of each tile. Be aware this may not look pretty if you have many small tiles!
geom_text(aes(label=p_days_reported))+. In the aesthetic aes() of the geom_tile() the argument label (what to show) is set to the same numeric column used to create the color gradient.ggplot(agg_weeks,
aes(x = aweek::week2date(week), # transformed to date class
y = location_name,
fill = p_days_reported))+
# tiles
geom_tile(colour="white")+ # white gridlines
geom_text(aes(label = p_days_reported))+ # add text on top of tile
scale_fill_gradient(low = "orange", high = "darkgreen", na.value = "grey80")+
scale_x_date(expand = c(0,0),
date_breaks = "2 weeks",
date_labels = "%d\n%b")+
# aesthetic themes
theme_minimal()+ # simplify background
theme(
legend.title = element_text(size=12, face="bold"),
legend.text = element_text(size=10, face="bold"),
legend.key.height = grid::unit(1,"cm"), # height of legend key
legend.key.width = grid::unit(0.6,"cm"), # width of legend key
axis.text.x = element_text(size=12),
axis.text.y = element_text(vjust=0.2),
axis.ticks = element_line(size=0.4),
axis.title = element_text(size=12, face="bold"),
plot.title = element_text(hjust=0,size=14,face="bold"),
plot.caption = element_text(hjust = 0, face = "italic")
)+
# plot labels
labs(x = "Week",
y = "Facility name",
fill = "Reporting\nperformance (%)", # legend title
title = "Percent of days per week that facility reported data",
subtitle = "District health facilities, April-May 2019",
caption = "7-day weeks beginning on Mondays.")Contoured heatmap of cases over a basemap
linelist using the latitude and longitudehttp://data-analytics.net/cep/Schedule_files/geospatial.html
pacman::p_load(OpenStreetMap)
# Fit basemap by range of lat/long coordinates. Choose tile type
map <- openmap(c(max(linelist$lat, na.rm=T), max(linelist$lon, na.rm=T)), # limits of tile
c(min(linelist$lat, na.rm=T), min(linelist$lon, na.rm=T)),
zoom = NULL,
type = c("osm", "stamen-toner", "stamen-terrain","stamen-watercolor", "esri","esri-topo")[1],
mergeTiles = TRUE)
# Projection WGS84
map.latlon <- openproj(map, projection = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# Plot map. Must be autoplotted to work with ggplot
OpenStreetMap::autoplot.OpenStreetMap(map.latlon)+
# Density tiles
ggplot2::stat_density_2d(aes(x = lon,
y = lat,
fill = ..level..,
alpha=..level..),
bins = 10,
geom = "polygon",
contour_var = "count",
data = linelist,
show.legend = F) +
scale_fill_gradient(low = "black", high = "red")+
labs(x = "Longitude",
y = "Latitude",
title = "Distribution of simulated cases")https://www.rdocumentation.org/packages/OpenStreetMap/versions/0.3.4/topics/autoplot.OpenStreetMap
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The primary tool to handle, analyse and visualise transmission chains and contact tracing data is the package epicontacts, developed by the folks at RECON. Try out the interactive plot below by hovering over the nodes for more information, dragging them to move them and clicking on them to highlight downstream cases.
First load the standard packages required for data import and manipulation.
pacman::p_load(
rio, # File import
here, # File locator
tidyverse, # Data management + ggplot2 graphics
remotes # Package installation from github
)You will require the development version of epicontacts, which can be installed from github using the remotes package. You only need to run the code below once, not every time you use the package.
remotes::install_github("reconhub/epicontacts@timeline")Next, import the standard, cleaned linelist for this analysis.
# import the cleaned linelist
linelist <- rio::import("linelist_cleaned.xlsx")We then need to create an epicontacts object, which requires two types of data:
As we already have a linelist, we just need to create a list of edges between
cases, more specifically between their IDs. We can extract transmission links from the
linelist by linking the infector column with the case_id column. At this point we can also add “edge
properties”, by which we mean any variable describing the link between the two
cases, not the cases themselves. For illustration, we will add a location
variable describing the location of the transmission event, and a duration
variable describing the duration of the contact in days.
In the code below, the dplyr function transmute is similar to mutate, except it only keeps
the columns we have specified within the function. The drop_na function will
filter out any rows where the specified columns have an NA value; in this
case, we only want to keep the rows where the infector is known.
## generate contacts
contacts <- linelist %>%
transmute(
infector = infector,
case_id = case_id,
location = sample(c("Community", "Nosocomial"), n(), TRUE),
duration = sample.int(10, n(), TRUE)
) %>%
drop_na(infector)We can now create the epicontacts object using the make_epicontacts
function. We need to specify which column in the linelist points to the unique case
identifier, as well as which columns in the contacts point to the unique
identifiers of the cases involved in each link. These links are directional in
that infection is going from the infector to the case, so we need to specify
the from and to arguments accordingly. We therefore also set the directed
argument to TRUE, which will affect future operations.
## generate epicontacts object
epic <- make_epicontacts(
linelist = linelist,
contacts = contacts,
id = "case_id",
from = "infector",
to = "case_id",
directed = TRUE
)Upon examining the epicontacts objects, we can see that the case_id column
in the linelist has been renamed to id and the case_id and infector
columns in the contacts have been renamed to from and to. This ensures
consistency in subsequent handling, visualisation and analysis operations.
## view epicontacts object
epic##
## /// Epidemiological Contacts //
##
## // class: epicontacts
## // 5,888 cases in linelist; 3,800 contacts; directed
##
## // linelist
##
## # A tibble: 5,888 x 30
## id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat
## <chr> <dbl> <date> <date> <date> <date> <chr> <chr> <dbl> <chr> <dbl> <fct> <fct> <chr> <dbl> <dbl>
## 1 d8a1~ 4 2014-05-06 2014-05-08 2014-05-10 NA <NA> f 3 years 3 0-4 0-4 St. Mar~ -13.2 8.46
## 2 8689~ 4 NA 2014-05-13 2014-05-14 2014-05-18 Recover f 7 years 7 5-9 5-9 Missing -13.2 8.45
## 3 11f8~ 2 NA 2014-05-16 2014-05-18 2014-05-30 Recover m 21 years 21 20-29 20-24 St. Mar~ -13.2 8.46
## 4 dae8~ 3 2014-05-23 NA 2014-05-27 2014-05-30 Death f 4 years 4 0-4 0-4 Port Ho~ -13.2 8.45
## 5 acf4~ 6 2014-05-25 2014-05-27 2014-05-28 2014-06-27 Recover m 4 years 4 0-4 0-4 Central~ -13.3 8.48
## 6 1a4a~ 6 NA 2014-05-27 2014-05-29 2014-06-07 Death m 30 years 30 30-49 30-34 Port Ho~ -13.3 8.45
## 7 275c~ 5 2014-05-24 2014-05-27 2014-05-28 2014-06-07 Death f 13 years 13 10-14 10-14 Central~ -13.2 8.47
## 8 1389~ 4 NA 2014-06-05 2014-06-07 2014-06-09 Death f 2 years 2 0-4 0-4 Missing -13.3 8.47
## 9 057e~ 7 2014-06-04 2014-06-14 2014-06-15 NA Recover f 4 years 4 0-4 0-4 Missing -13.2 8.47
## 10 c97d~ 9 NA NA 2014-06-19 2014-07-11 Recover m 22 years 22 20-29 20-24 Port Ho~ -13.2 8.47
## # ... with 5,878 more rows, and 14 more variables: infector <chr>, source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>,
## # cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>
##
## // contacts
##
## # A tibble: 3,800 x 4
## from to location duration
## <chr> <chr> <chr> <int>
## 1 20b688 d8a13d Community 9
## 2 be7f8a dae8c7 Community 8
## 3 1511c5 acf422 Nosocomial 5
## 4 e02f66 275cc7 Nosocomial 2
## 5 cbbe78 057e7a Nosocomial 4
## 6 e61cb9 02d8fd Nosocomial 10
## 7 057e7a c36eb4 Nosocomial 5
## 8 4977bd 542d07 Nosocomial 6
## 9 a75c7f 7f5a01 Community 6
## 10 ea3740 b799eb Nosocomial 5
## # ... with 3,790 more rows
The subset() method for epicontacts objects allows for, among other things,
filtering of networks based on properties of the linelist (“node attributes”) and the contacts
database (“edge attributes”). These values must be passed as named lists to the
respective argument. For example, in the code below we are keeping only the
male cases in the linelist that have an infection date between April and
July 2014 (dates are specified as ranges), and transmission links that occured
in the hospital.
sub_attributes <- subset(
epic,
node_attribute = list(
gender = "m",
date_infection = as.Date(c("2014-04-01", "2014-07-01"))
),
edge_attribute = list(location = "Nosocomial")
)
sub_attributes##
## /// Epidemiological Contacts //
##
## // class: epicontacts
## // 67 cases in linelist; 1,880 contacts; directed
##
## // linelist
##
## # A tibble: 67 x 30
## id generation date_infection date_onset date_hospitalis~ date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat
## <chr> <dbl> <date> <date> <date> <date> <chr> <chr> <dbl> <chr> <dbl> <fct> <fct> <chr> <dbl> <dbl>
## 1 acf4~ 6 2014-05-25 2014-05-27 2014-05-28 2014-06-27 Recover m 4 years 4 0-4 0-4 Central~ -13.3 8.48
## 2 02d8~ 9 2014-06-14 2014-06-20 2014-06-20 2014-07-01 Death m 20 years 20 20-29 20-24 Port Ho~ -13.2 8.45
## 3 c36e~ 8 2014-06-15 2014-06-20 2014-06-21 2014-06-24 Death m 16 years 16 15-19 15-19 Port Ho~ -13.3 8.48
## 4 542d~ 5 2014-06-10 NA 2014-06-23 2014-06-28 Death m 16 years 16 15-19 15-19 Port Ho~ -13.3 8.46
## 5 b799~ 5 2014-06-27 2014-07-03 2014-07-05 2014-07-12 Recover m 14 years 14 10-14 10-14 Missing -13.2 8.46
## 6 3057~ 6 2014-05-17 2014-05-27 2014-06-07 2014-06-07 <NA> m 9 years 9 5-9 5-9 Port Ho~ -13.2 8.47
## 7 e857~ 5 2014-05-27 2014-06-03 2014-06-08 2014-06-02 Death m 21 years 21 20-29 20-24 Missing -13.2 8.48
## 8 d330~ 9 2014-06-27 2014-07-04 2014-07-09 2014-07-10 Recover m 10 years 10 10-14 10-14 St. Mar~ -13.2 8.47
## 9 a3c8~ 4 2014-05-07 2014-05-08 2014-05-10 2014-05-14 Recover m 13 years 13 10-14 10-14 Port Ho~ -13.2 8.47
## 10 72b9~ 4 2014-05-06 NA 2014-05-13 NA Recover m 17 years 17 15-19 15-19 Other -13.2 8.47
## # ... with 57 more rows, and 14 more variables: infector <chr>, source <chr>, wt_kg <dbl>, ht_cm <dbl>, ct_blood <dbl>, fever <chr>, chills <chr>,
## # cough <chr>, aches <chr>, vomit <chr>, temp <dbl>, time_admission <chr>, bmi <dbl>, days_onset_hosp <dbl>
##
## // contacts
##
## # A tibble: 1,880 x 4
## from to location duration
## <chr> <chr> <chr> <int>
## 1 1511c5 acf422 Nosocomial 5
## 2 e02f66 275cc7 Nosocomial 2
## 3 cbbe78 057e7a Nosocomial 4
## 4 e61cb9 02d8fd Nosocomial 10
## 5 057e7a c36eb4 Nosocomial 5
## 6 4977bd 542d07 Nosocomial 6
## 7 ea3740 b799eb Nosocomial 5
## 8 36e2e7 6d788e Nosocomial 3
## 9 7baf73 67be4e Nosocomial 7
## 10 3b096b 3789ee Nosocomial 10
## # ... with 1,870 more rows
We can use the thin function to either filter the linelist to include cases
that are found in the contacts by setting the argument what = "linelist", or
filter the contacts to include cases that are found in the linelist by setting
the argument what = "contacts". In the code below, we are further filtering the
epicontacts object to keep only the transmission links involving the male cases
infected between April and July which we had filtered for above. We can see that
only two known transmission links fit that specification.
sub_attributes <- thin(sub_attributes, what = "contacts")
nrow(sub_attributes$contacts)## [1] 3
In addition to subsetting by node and edge attributes, networks can be pruned to
only include components that are connected to certain nodes. The cluster_id
argument takes a vector of case IDs and returns the linelist of individuals that
are linked, directly or indirectly, to those IDs. In the code below, we can see
that a total of 13 linelist cases are involved in the clusters containing
2ae019 and 71577a.
sub_id <- subset(epic, cluster_id = c("2ae019","71577a"))
nrow(sub_id$linelist)## [1] 13
The subset() method for epicontacts objects also allows filtering by cluster
size using the cs, cs_min and cs_max arguments. In the code below, we are
keeping only the cases linked to clusters of 10 cases or larger, and can see that
271 linelist cases are involved in such clusters.
sub_cs <- subset(epic, cs_min = 10)
nrow(sub_cs$linelist)## [1] 271
The get_id() function retrieves information on case IDs in the
dataset, and can be parameterized as follows:
For example, what are the first ten IDs in the contacts dataset?
contacts_ids <- get_id(epic, "contacts")
head(contacts_ids, n = 10)## [1] "20b688" "be7f8a" "1511c5" "e02f66" "cbbe78" "e61cb9" "057e7a" "4977bd" "a75c7f" "ea3740"
How many IDs are found in both the linelist and the contacts?
length(get_id(epic, "common"))## [1] 4352
All visualisations of epicontacts objects are handled by the plot
function. We will first filter the epicontacts object to include only the
cases with onset dates in June 2014 using the subset function, and only
include the contacts linked to those cases using the thin function.
## subset epicontacts object
sub <- epic %>%
subset(
node_attribute = list(date_onset = c(as.Date(c("2014-06-30", "2014-06-01"))))
) %>%
thin("contacts")We can then create the basic, interactive plot very simply as follows:
## plot epicontacts object
plot(
sub,
width = 700,
height = 700
)You can move the nodes around by dragging them, hover over them for more information and click on them to highlight connected cases.
There are a large number of arguments to further modify this plot. We will cover
the main ones here, but check out the documentation via ?vis_epicontacts (the
function called when using plot on an epicontacts object) to get a full
description of the function arguments.
Node color, node shape and node size can be mapped to a given column in the linelist
using the node_color, node_shape and node_size arguments. This is similar
to the aes syntax you may recognise from ggplot2.
The specific colors, shapes and sizes of nodes can be specified as follows:
Colors via the col_pal argument, either by providing a name list for manual
specification of each color as done below, or by providing a color palette
function such as colorRampPalette(c("black", "red", "orange")), which would
provide a gradient of colours between the ones specified.
Shapes by passing a named list to the shapes argument, specifying one shape
for each unique element in the linelist column specified by the node_shape
argument. See codeawesome for available shapes.
Size by passing a size range of the nodes to the size_range argument.
Here an example, where color represents the outcome, shape the gender and size the age:
plot(
sub,
node_color = "outcome",
node_shape = "gender",
node_size = 'age',
col_pal = c(Death = "firebrick", Recover = "green"),
shapes = c(f = "female", m = "male"),
size_range = c(40, 60),
height = 700,
width = 700
)Edge color, width and linetype can be mapped to a given column in the contacts
dataframe using the edge_color, edge_width and edge_linetype
arguments. The specific colors and widths of the edges can be specified as follows:
Colors via the edge_col_pal argument, in the same manner used for col_pal.
Widths by passing a size range of the nodes to the width_range argument.
Here an example:
plot(
sub,
node_color = "outcome",
node_shape = "gender",
node_size = 'age',
col_pal = c(Death = "firebrick", Recover = "green"),
shapes = c(f = "female", m = "male"),
size_range = c(40, 60),
edge_color = 'location',
edge_linetype = 'location',
edge_width = 'duration',
edge_col_pal = c(Community = "orange", Nosocomial = "purple"),
width_range = c(1, 3),
height = 700,
width = 700
)We can also visualise the network along a temporal axis by mapping the x_axis
argument to a column in the linelist. In the example below, the x-axis
represents the date of symptom onset. We have also specified the arrow_size
argument to ensure the arrows are not too large, and set label = FALSE to make
the figure less cluttered.
plot(
sub,
x_axis = "date_onset",
node_color = "outcome",
col_pal = c(Death = "firebrick", Recover = "green"),
arrow_size = 0.5,
node_size = 13,
label = FALSE,
height = 700,
width = 700
)There are a large number of additional arguments to futher specify how this
network is visualised along a temporal axis, which you can check out
via ?vis_temporal_interactive (the function called when using plot on
an epicontacts object with x_axis specified). We’ll go through some
below.
There are two main shapes that the transmission tree can assume, specified using
the network_shape argument. The first is a branching shape as shown above,
where a straight edge connects any two nodes. This is the most intuitive
representation, however can result in overlapping edges in a densely connected
network. The second shape is rectangle, which produces a tree resembling a
phylogeny. For example:
plot(
sub,
x_axis = "date_onset",
network_shape = "rectangle",
node_color = "outcome",
col_pal = c(Death = "firebrick", Recover = "green"),
arrow_size = 0.5,
node_size = 13,
label = FALSE,
height = 700,
width = 700
)Each case node can be assigned a unique vertical position by toggling the
position_dodge argument. The position of unconnected cases (i.e. with no
reported contacts) is specified using the unlinked_pos argument.
plot(
sub,
x_axis = "date_onset",
network_shape = "rectangle",
node_color = "outcome",
col_pal = c(Death = "firebrick", Recover = "green"),
position_dodge = TRUE,
unlinked_pos = "bottom",
arrow_size = 0.5,
node_size = 13,
label = FALSE,
height = 700,
width = 700
)The position of the parent node relative to the children nodes can be
specified using the parent_pos argument. The default option is to place the
parent node in the middle, however it can be placed at the bottom (parent_pos = 'bottom') or at the top (parent_pos = 'top').
plot(
sub,
x_axis = "date_onset",
network_shape = "rectangle",
node_color = "outcome",
col_pal = c(Death = "firebrick", Recover = "green"),
parent_pos = "top",
arrow_size = 0.5,
node_size = 13,
label = FALSE,
height = 700,
width = 700
)You can save a plot as an interactive, self-contained html file with the
visSave function from the VisNetwork package:
plot(
sub,
x_axis = "date_onset",
network_shape = "rectangle",
node_color = "outcome",
col_pal = c(Death = "firebrick", Recover = "green"),
parent_pos = "top",
arrow_size = 0.5,
node_size = 13,
label = FALSE,
height = 700,
width = 700
) %>%
visNetwork::visSave("network.html")Saving these network outputs as an image is unfortunately less easy and requires
you to save the file as an html and then take a screenshot of this file using
the webshot package. In the code below, we are converting the html file saved
above into a PNG:
webshot(url = "network.html", file = "network.png")You can also case timelines to the network, which are represented on the x-axis of each case. This can be used to visualise case locations, for example, or time to outcome. To generate a timeline, we need to create a data.frame of at least three columns indicating the case ID, the start date of the “event” and the end of date of the “event”. You can also add any number of other columns which can then be mapped to node and edge properties of the timeline. In the code below, we generate a timeline going from the date of symptom onset to the date of outcome, and keep the outcome and hospital variables which we use to define the node shape and colour. Note that you can have more than one timeline row/event per case, for example if a case is transferred between multiple hospitals.
## generate timeline
timeline <- linelist %>%
transmute(
id = case_id,
start = date_onset,
end = date_outcome,
outcome = outcome,
hospital = hospital
)We then pass the timeline element to the timeline argument. We can map
timeline attributes to timeline node colours, shapes and sizes in the same way
defined in previous sections, except that we have two nodes: the start and end
node of each timeline, which have seperate arguments. For example,
tl_start_node_color defines which timeline column is mapped to the colour of
the start node, while tl_end_node_shape defines which timeline column is
mapped to the shape of the end node. We can also map colour, width, linetype and
labels to the timeline edge via the tl_edge_* arguments.
See ?vis_temporal_interactive (the function called when plotting an
epicontacts object) for detailed documentation on the arguments. Each argument
is annotated in the code below too:
## define shapes
shapes <- c(
f = "female",
m = "male",
Death = "user-times",
Recover = "heartbeat",
"NA" = "question-circle"
)
## define colours
colours <- c(
Death = "firebrick",
Recover = "green",
"NA" = "grey"
)
## make plot
plot(
sub,
## max x coordinate to date of onset
x_axis = "date_onset",
## use rectangular network shape
network_shape = "rectangle",
## mape case node shapes to gender column
node_shape = "gender",
## we don't want to map node colour to any columns - this is important as the
## default value is to map to node id, which will mess up the colour scheme
node_color = NULL,
## set case node size to 30 (as this is not a character, node_size is not
## mapped to a column but instead interpreted as the actual node size)
node_size = 30,
## set transmission link width to 4 (as this is not a character, edge_width is
## not mapped to a column but instead interpreted as the actual edge width)
edge_width = 4,
## provide the timeline object
timeline = timeline,
## map the shape of the end node to the outcome column in the timeline object
tl_end_node_shape = "outcome",
## set the size of the end node to 15 (as this is not a character, this
## argument is not mapped to a column but instead interpreted as the actual
## node size)
tl_end_node_size = 15,
## map the colour of the timeline edge to the hospital column
tl_edge_color = "hospital",
## set the width of the timeline edge to 2 (as this is not a character, this
## argument is not mapped to a column but instead interpreted as the actual
## edge width)
tl_edge_width = 2,
## map edge labels to the hospital variable
tl_edge_label = "hospital",
## specify the shape for everyone node attribute (defined above)
shapes = shapes,
## specify the colour palette (defined above)
col_pal = colours,
## set the size of the arrow to 0.5
arrow_size = 0.5,
## use two columns in the legend
legend_ncol = 2,
## set font size
font_size = 15,
## define formatting for dates
date_labels = c("%d %b %Y"),
## don't plot the ID labels below nodes
label = FALSE,
## specify height
height = 1000,
## specify width
width = 1200,
## ensure each case node has a unique y-coordinate - this is very important
## when using timelines, otherwise you will have overlapping timelines from
## different cases
position_dodge = TRUE
)## Warning in assert_timeline(timeline, x, x_axis): 5865 timeline row(s) removed as ID not found in linelist or start/end date is NA
We can get an overview of some of the network properties using the
summary function.
## summarise epicontacts object
summary(epic)##
## /// Overview //
## // number of unique IDs in linelist: 5888
## // number of unique IDs in contacts: 5511
## // number of unique IDs in both: 4352
## // number of contacts: 3800
## // contacts with both cases in linelist: 56.868 %
##
## /// Degrees of the network //
## // in-degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000000000 0.0000000000000 1.0000000000000 0.5392365545622 1.0000000000000 1.0000000000000
##
## // out-degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000000000 0.0000000000000 0.0000000000000 0.5392365545622 1.0000000000000 6.0000000000000
##
## // in and out degree summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000000000 1.000000000000 1.000000000000 1.078473109124 1.000000000000 7.000000000000
##
## /// Attributes //
## // attributes in linelist:
## generation date_infection date_onset date_hospitalisation date_outcome outcome gender age age_unit age_years age_cat age_cat5 hospital lon lat infector source wt_kg ht_cm ct_blood fever chills cough aches vomit temp time_admission bmi days_onset_hosp
##
## // attributes in contacts:
## location duration
For example, we can see that only 57% of contacts have both cases in the linelist; this means that the we do not have linelist data on a significant number of cases involved in these transmission chains.
The get_pairwise() function allows processing of variable(s) in the line list
according to each pair in the contact dataset. For the following example, date
of onset of disease is extracted from the line list in order to compute the
difference between disease date of onset for each pair. The value that is
produced from this comparison represents the serial interval (si).
si <- get_pairwise(epic, "date_onset")
summary(si)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000000000 5.0000000000 9.0000000000 10.9128256513 15.0000000000 99.0000000000 1804
tibble(si = si) %>%
ggplot(aes(si)) +
geom_histogram() +
labs(
x = "Serial interval",
y = "Frequency"
)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1804 rows containing non-finite values (stat_bin).
The get_pairwise() will interpret the class of the column being used for
comparison, and will adjust its method of comparing the values accordingly. For
numbers and dates (like the si example above), the function will subtract
the values. When applied to columns that are characters or categorical,
get_pairwise() will paste values together. Because the function also allows
for arbitrary processing (see “f” argument), these discrete combinations can be
easily tabulated and analyzed.
head(get_pairwise(epic, "gender"), n = 10)## [1] "m -> f" NA "f -> m" "f -> f" NA "f -> m" "f -> m" NA NA NA
get_pairwise(epic, "gender", f = table)## values.to
## values.from f m
## f 470 518
## m 519 436
fisher.test(get_pairwise(epic, "gender", f = table))##
## Fisher's Exact Test for Count Data
##
## data: get_pairwise(epic, "gender", f = table)
## p-value = 0.003167711360736
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.6351504904440458 0.9147305045225598
## sample estimates:
## odds ratio
## 0.7623492175473596
Here, we see a significant association between transmission links and gender.
The get_clusters() function can be used for to identify connected components
in an epicontacts object. First, we use it to retrieve a data.frame
containing the cluster information:
clust <- get_clusters(epic, output = "data.frame")
table(clust$cluster_size)##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1536 1680 1182 784 545 342 308 208 171 100 99 24 26 42
ggplot(clust, aes(cluster_size)) +
geom_bar() +
labs(
x = "Cluster size",
y = "Frequency"
)Let us look at the largest clusters. For this, we add cluster information to the
epicontacts object and then subset it to keep only the largest clusters:
epic <- get_clusters(epic)
max_size <- max(epic$linelist$cluster_size)
plot(subset(epic, cs = max_size))The degree of a node corresponds to its number of edges or connections to other
nodes. get_degree() provides an easy method for calculating this value for
epicontacts networks. A high degree in this context indicates an individual
who was in contact with many others. The type argument indicates that we want
to count both the in-degree and out-degree, the only_linelist argument
indicates that we only want to calculate the degree for cases in the linelist.
deg_both <- get_degree(epic, type = "both", only_linelist = TRUE)Which individuals have the ten most contacts?
head(sort(deg_both, decreasing = TRUE), 10)## 916d0a 858426 6833d7 f093ea 11f8ea 02d8fd a2b371 38fc71 07c4ad 72268f
## 7 6 6 6 5 5 5 5 5 5
What is the mean number of contacts?
mean(deg_both)## [1] 1.07847310912445
The epicontacts page provides an overview of the package functions and includes some more in-depth vignettes.
The github page can be used to raise issues and request features.
Phylogenetic trees are used to visualize and describe the relatedness and evolution of organisms based on the sequence of their genetic code. They can be constructed from genetic sequences using distance-based methods (such as neighbor-joining method) or character-based methods (such as maximum likelihood and Bayesian Markov Chain Monte Carlo method). Next-generation sequencing (NGS) has become more affordable and is becoming more widely used in public health to describe pathogens causing infectious diseases. Portable devices decrease the turn around time and make data available for the support of outbreak investigation in real-time. NGS data can be used to identify the origin or source of an outbreak strain and its propagation, as well as determine presence of antimicrobial resistance genes. To visualize the genetic relatedness between samples a phylogenetic tree is constructed. In this page we will learn how to use the ggtree() package, which allows for combination of phylogenetic trees with additional sample data in form of a dataframe in order to help observe patterns and improve understanding of the outbreak dynamic.
This code chunk shows the loading of required packages:
# First we load the pacman package:
library(pacman)
# This allows us to load multiple packages at the same time in one line of code:
pacman::p_load(here, ggplot2, dplyr, ape, ggtree, treeio, ggnewscale)There are several different formats in which a phylogenetic tree can be stored (eg. Newick, NEXUS, Phylip). A common one, which we will also use here in this example is the Newick file format (.nwk), which is the standard for representing trees in computer-readable form. Which means, an entire tree can be expressed in a string format such as “((t2:0.04,t1:0.34):0.89,(t5:0.37,(t4:0.03,t3:0.67):0.9):0.59);” listing all nodes and tips and their relationship (branch length) to each other.
It is important to understand that the phylogenetic tree file in itself does not contain sequencing data, but is merely the result of the distances between the sequences. We therefore cannot extract sequencing data from a tree file.
We use the ape() package to import a phylogenetic tree file and store it in a list object of class “phylo”. We inspect our tree object and see it contains 299 tips (or samples) and 236 nodes.
# read in the tree: we use the here package to specify the location of our R project and data files:
tree <- ape::read.tree(here::here("data", "Shigella_tree.nwk"))
# inspect the tree file:
tree##
## Phylogenetic tree with 299 tips and 236 internal nodes.
##
## Tip labels:
## SRR5006072, SRR4192106, S18BD07865, S18BD00489, S17BD08906, S17BD05939, ...
## Node labels:
## 17, 29, 100, 67, 100, 100, ...
##
## Rooted; includes branch lengths.
Second we import a table with additional information for each sequenced sample such as gender, country of origine and attributes for antimicrobial resistance:
# We read in a csv file into a dataframe format:
sample_data <- read.csv(here::here("data","sample_data_Shigella_tree.csv"),sep=",", na.strings=c("NA"), head = TRUE, stringsAsFactors=F)We clean and inspect our data: In order to assign the correct sample data to the phylogenetic tree, the Sample_IDs in the sample_data file need to match the tip.labels in the tree file:
# We clean the data: we select certain columns to be protected from cleaning in order to main tain their formating (eg. for the sample names, as they have to match the names in the phylogenetic tree file)
#sample_data <- linelist::clean_data(sample_data, protect = c(1, 3:5))
# We check the formatting of the tip labels in the tree file:
head(tree$tip.label) # these are the sample names in the tree - we inspect the first 6 with head()## [1] "SRR5006072" "SRR4192106" "S18BD07865" "S18BD00489" "S17BD08906" "S17BD05939"
# We make sure the first column in our dataframe are the Sample_IDs:
colnames(sample_data) ## [1] "Sample_ID" "serotype" "Country" "Continent" "Travel_history"
## [6] "Year" "Patient_age" "Source" "Gender" "gyrA_mutations"
## [11] "macrolide_resistance_genes" "ESBL" "MIC_AZM" "MIC_CIP"
# We look at the sample_IDs in the dataframe to make sure the formatting is the same than in the tip.labels (eg. letters are all capital, no extra _ between letters and numbers etc.)
head(sample_data$Sample_ID) # we inspect only the first 6 using head()## [1] "ERR025692" "ERR025682" "ERR025714" "ERR025713" "ERR025709" "ERR025711"
Upon inspection we can see that the format of sample_ID in the dataframe corresponds to the format of sample names at the tree tips. These do not have to be sorted in the same order to be matched.
We are ready to go!
ggtree() offers many different layout formats and some may be more suitable for your specific purpose than others:
# Examples:
ggtree(tree) # most simple linear tree
ggtree(tree, branch.length = "none") # most simple linear tree with all tips aligned
ggtree(tree, layout="circular") # most simple circular tree
ggtree(tree, layout="circular", branch.length = "none") # most simple circular tree with all tips aligned
# for other options see online: http://yulab-smu.top/treedata-book/chapter4.htmlThe most easy annotation of your tree is the addition of the sample names at the tips, as well as coloring of tip points and if desired branches:
# A: Plot Circular tree:
ggtree(tree, layout="circular", branch.length='none') %<+% sample_data + # the %<+% is used to add your dataframe with sample data to the tree
aes(color=I(Source))+ # color the branches according to a variable in your dataframe
scale_color_manual(name = "Sample Origin", # name of your color scheme (will show up in the legend like this)
breaks = c("NRC BEL", "NA"), # the different options in your variable
labels = c("NRCSS Belgium", ""), # how you want the different options named in your legend, allows for formatting
values= c("blue"), # the color you want to assign to the variable if its "nrc_bel"
na.value="grey")+ # for the NA values we choose the color grey
new_scale_color()+ # allows to add an additional color scheme for another variable
geom_tippoint(aes(color=Continent), size=1.5)+ # color the tip point by continent, you may change shape adding "shape = "
scale_color_brewer(name = "Continent", # name of your color scheme (will show up in the legend like this)
palette="Set1", # we choose a premade set of colors coming with the brewer package
na.value="grey")+ # for the NA values we choose the color grey
geom_tiplab(color='black', offset = 1, size = 1, geom = "text" , align=TRUE)+ # add the name of the sample to the tip of its branch (you can add as many text lines as you like with the + , you just need to change the offset value to place them next to each other)
ggtitle("Phylogenetic tree of Shigella sonnei")+ # title of your graph
theme(axis.title.x=element_blank(), # removes x-axis title
axis.title.y=element_blank(), # removes y-axis title
legend.title=element_text(face="bold", size =12), # defines font size and format of the legend title
legend.text=element_text(face="bold", size =10), # defines font size and format of the legend text
plot.title = element_text(size =12, face="bold"), # defines font size and format of the plot title
legend.position="bottom", # defines placement of the legend
legend.box="vertical", legend.margin=margin()) # defines placement of the legend## Warning: Duplicated aesthetics after name standardisation: size
## Warning: Duplicated aesthetics after name standardisation: size
# Export your tree graph:
ggsave(here::here("example_tree_circular_1.png"), width = 12, height = 14)Sometimes you may have a very large phylogenetic tree and you are only interested in one part of the tree. For example if you produced a tree including historical or international samples to get a large overview of where your dataset might fit in in the bigger picture. But then to look closer at your data you want to inspect only that portion of the bigger tree.
Since the phylogenetic tree file is just the output of sequencing data analysis, we can not manipulate the order of the nodes and branches in the file itself. These have already been determined in previous analysis from the raw NGS data. We are able though to zoom into parts, hide parts and seven subset part of the tree.
If you don’t want to “cut” your tree, but only inspect part of it more closely you can zoom in to view a specific part:
# First we plot the whole tree:
p <- ggtree(tree,) %<+% sample_data +
geom_tiplab(size =1.5) + # labels the tips of all branche with the sample name in the tree file
geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust=1, vjust =1) # labels all the nodes in the tree
pWe want to zoom into the branch which is sticking out, after node number 452 to get a closer look:
viewClade(p , node=452)The other way around we may want to ignore this branch which is sticking out and can do so by collapsing it at the node (indicated here by the blue square):
#First we collapse at node 452
p_collapsed <- collapse(p, node=452)
#To not forget that we collapsed this node we assign a symbol to it:
p_collapsed + geom_point2(aes(subset=(node == 452)), size=5, shape=23, fill="steelblue")If we want to make a more permanent change and create a new tree to work with we can subset part of it and even save it as new newick tree file.
# To do so you can add the node and tip labels to your tree to see which part you want to subset:
ggtree(tree, branch.length='none', layout='circular') %<+% sample_data +
geom_tiplab(size =1) + # labels the tips of all branche with the sample name in the tree file
geom_text2(aes(subset=!isTip, label=node), size =3, color = "darkred") +# labels all the nodes in the tree
theme(legend.position = "none", # removes the legend all together
axis.title.x=element_blank(),
axis.title.y=element_blank(),
plot.title = element_text(size =12, face="bold"))
# A: Subset tree based on node:
sub_tree1 <- tree_subset(tree, node = 528) # we subset the tree at node 528
# lets have a look at the subset tree:
ggtree(sub_tree1)+ geom_tiplab(size =3) +
ggtitle("Subset tree 1")
# B: Subset the same part of the tree based on a samplem in this case S17BD07692:
sub_tree2 <- tree_subset(tree,"S17BD07692", levels_back = 9) # levels back defines how many nodes backwards from the sample tip you want to go
# lets have a look at the subset tree:
ggtree(sub_tree2)+ geom_tiplab(size =3) +
ggtitle("Subset tree 2")You can also save your new tree as a Newick file:
ape::write.tree(sub_tree2, file='Shigelle_subtree_2.nwk')As mentioned before we cannot change the order of tips or nodes in the tree, as this is based on their genetic relatedness and is not subject to visual manipulation. But we can rote branches around nodes if that eases our visualization.
First we plot our new subsetted tree with nodelabels to choose the node we want to manipulate:
p <- ggtree(sub_tree2) + geom_tiplab(size =4) +
geom_text2(aes(subset=!isTip, label=node), size =5, color = "darkred", hjust =1, vjust =1) # labels all the nodes in the tree
pWe choose to manipulate node number 39: we do so by applying ggtree::rotate() or ggtree::fluip() indirectly to node 36 so node 39 moves to the bottom and nodes 37 and 38 move to the top:
#
# p1 <- p + geom_hilight(39, "steelblue", extend =0.0015)+ # highlights the node 39 in blue
# geom_hilight(37, "yellow", extend =0.0015) + # highlights the node 37 in yellow
# ggtitle("Original tree")
#
# # we want to rotate node 36 so node 39 is on the bottom and nodes 37 and 38 move to the top:
#
# rotate(p1, 39) %>% rotate(37)+
# ggtitle("Rotated Node 36")
#
# #or we can use the flip command to achieve the same thing:
# flip(p1, 39, 37)Lets say we are investigating the cluster of cases with clonal expansion which occured in 2017 and 2018 at node 39 in our sub-tree. We add the year of strain isolation as well as travel history and color by country to see origin of other closely related strains:
# Add sample data:
ggtree(sub_tree2) %<+% sample_data +
geom_tiplab(size =2.5, offset = 0.001, align = TRUE) + # labels the tips of all branche with the sample name in the tree file
theme_tree2()+
xlab("genetic distance")+ # add a label to the x-azis
xlim(0, 0.015)+ # set the x-axis limits of our tree
geom_tippoint(aes(color=Country), size=1.5)+ # color the tip point by continent
scale_color_brewer(name = "Country",
palette="Set1",
na.value="grey")+
geom_tiplab(aes(label = Year), color='blue', offset = 0.0045, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add isolation year
geom_tiplab(aes(label = Travel_history), color='red', offset = 0.006, size = 3, linetype = "blank" , geom = "text" , align=TRUE)+ # add travel history
ggtitle("Phylogenetic tree of Belgian S. sonnei strains with travel history")+ # add plot title
theme(axis.title.x=element_blank(),
axis.title.y=element_blank(),
legend.title=element_text(face="bold", size =12),
legend.text=element_text(face="bold", size =10),
plot.title = element_text(size =12, face="bold"))## Warning: Duplicated aesthetics after name standardisation: size
## Warning: Duplicated aesthetics after name standardisation: size
## Warning: Duplicated aesthetics after name standardisation: size
Our observation points towards an import of strains from Asia, which then circulated in Belgium over the years and seem to have caused our latest outbreak.
We can add more complex information, such as categorical presence of antimicrobial resistance genes and numeric values for actually measured resistance to antimicrobials in form of a heatmap using the ggtree::gheatmap() function.
First we need to plot our tree (this can be either linear or circular): We will use the sub_stree from part 3.)
# A: Circular tree:
p <- ggtree(sub_tree2, branch.length='none', layout='circular') %<+% sample_data +
geom_tiplab(size =3) +
theme(legend.position = "none",
axis.title.x=element_blank(),
axis.title.y=element_blank(),
plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))
pSecond we prepare our data. To visualize different variables with new color schemes, we subset our dataframe to the desired variable.
For example we want to look at gender and mutations that could confer resistance to ciprofloxacin:
# Create your gender dataframe:
gender <- data.frame("gender" = sample_data[,c("Gender")])
# Its important to add the Sample_ID as rownames otherwise it cannot match the data to the tree tip.labels:
rownames(gender) <- sample_data$Sample_ID
# Create your ciprofloxacin dataframe based on mutations in the gyrA gene:
cipR <- data.frame("cipR" = sample_data[,c("gyrA_mutations")])
rownames(cipR) <- sample_data$Sample_ID
# Create your ciprofloxacin dataframe based on the measured minimum inhibitory concentration (MIC) from the laboratory:
MIC_Cip <- data.frame("mic_cip" = sample_data[,c("MIC_CIP")])
rownames(MIC_Cip) <- sample_data$Sample_IDWe create a first plot adding a binary heatmap for gender to the phylogenetic tree:
# First we add gender:
h1 <- gheatmap(p, gender, offset = 10, width=0.10, color=NULL, # offset shifts the heatmap to the right, width defines the width of the heatmap column, color defines the boarder of the heatmap columns
colnames = FALSE)+ # hides column names for the heatmap
scale_fill_manual(name = "Gender", # define the coloring scheme and legend for gender
values = c("#00d1b1", "purple"),
breaks = c("Male", "Female"),
labels = c("Male", "Female"))+
theme(legend.position="bottom",
legend.title = element_text(size=12),
legend.text = element_text(size =10),
legend.box="vertical", legend.margin=margin())## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h1Then we add information on ciprofloxacin resistance genes:
# First we assigng a new color scheme to our existing plot, this enables us to define and change the colors for our second variable
h2 <- h1 + new_scale_fill()
# then we combine these into a new plot:
h3 <- gheatmap(h2, cipR, offset = 12, width=0.10, # adds the second row of heatmap describing ciprofloxacin resistance genes
colnames = FALSE)+
scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
values = c("#fe9698","#ea0c92"),
breaks = c( "gyrA D87Y", "gyrA S83L"),
labels = c( "gyrA d87y", "gyrA s83l"))+
theme(legend.position="bottom",
legend.title = element_text(size=12),
legend.text = element_text(size =10),
legend.box="vertical", legend.margin=margin())+
guides(fill=guide_legend(nrow=2,byrow=TRUE))## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h3Next we add continuous data on actual resistance determined by the laboratory as the minimum inhibitory concentration (MIC) of ciprofloxacin :
# First we add the new coloring scheme:
h4 <- h3 + new_scale_fill()
# then we combine the two into a new plot:
h5 <- gheatmap(h4, MIC_Cip, offset = 14, width=0.10,
colnames = FALSE)+
scale_fill_continuous(name = "MIC for ciprofloxacin",
low = "yellow", high = "red",
breaks = c(0, 0.50, 1.00),
na.value = "white")+
guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
theme(legend.position="bottom",
legend.title = element_text(size=12),
legend.text = element_text(size =10),
legend.box="vertical", legend.margin=margin())## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5We can do the same exercise for a linear tree:
# B: Lineartree:
p <- ggtree(sub_tree2) %<+% sample_data +
geom_tiplab(size =3) + # labels the tips
theme_tree2()+
xlab("genetic distance")+
xlim(0, 0.015)+
theme(legend.position = "none",
axis.title.y=element_blank(),
plot.title = element_text(size =12, face="bold",hjust = 0.5, vjust = -15))
# First we add gender:
h1 <- gheatmap(p, gender, offset = 0.003, width=0.1, color="black",
colnames = FALSE)+
scale_fill_manual(name = "Gender",
values = c("#00d1b1", "purple"),
breaks = c("Male", "Female"),
labels = c("Male", "Female"))+
theme(legend.position="bottom",
legend.title = element_text(size=12),
legend.text = element_text(size =10),
legend.box="vertical", legend.margin=margin())## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
# h1
# Then we add ciprofloxacin after adding another colorscheme layer:
h2 <- h1 + new_scale_fill()
h3 <- gheatmap(h2, cipR, offset = 0.004, width=0.1,color="black",
colnames = FALSE)+
scale_fill_manual(name = "Ciprofloxacin resistance \n conferring mutation",
values = c("#fe9698","#ea0c92"),
breaks = c( "gyrA D87Y", "gyrA S83L"),
labels = c( "gyrA d87y", "gyrA s83l"))+
theme(legend.position="bottom",
legend.title = element_text(size=12),
legend.text = element_text(size =10),
legend.box="vertical", legend.margin=margin())+
guides(fill=guide_legend(nrow=2,byrow=TRUE))## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
# h3
# Then we add the minimum inhibitory concentration determined by the lab (MIC):
h4 <- h3 + new_scale_fill()
h5 <- gheatmap(h4, MIC_Cip, offset = 0.005, width=0.1, color="black",
colnames = FALSE)+
scale_fill_continuous(name = "MIC for ciprofloxacin",
low = "yellow", high = "red",
breaks = c(0,0.50,1.00),
na.value = "white")+
guides(fill = guide_colourbar(barwidth = 5, barheight = 1))+
theme(legend.position="bottom",
legend.title = element_text(size=10),
legend.text = element_text(size =8),
legend.box="horizontal", legend.margin=margin())+
guides(shape = guide_legend(override.aes = list(size = 2)))## Scale for 'y' is already present. Adding another scale for 'y', which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill', which will replace the existing scale.
h5http://hydrodictyon.eeb.uconn.edu/eebedia/index.php/Ggtree# Clade_Colors https://bioconductor.riken.jp/packages/3.2/bioc/vignettes/ggtree/inst/doc/treeManipulation.html https://guangchuangyu.github.io/ggtree-book/chapter-ggtree.html https://bioconductor.riken.jp/packages/3.8/bioc/vignettes/ggtree/inst/doc/treeManipulation.html
Data visualisation is increasingly required to be interrogable by the audience. Consequently creating interactive plots are becoming common. There are several ways to include these but the two most important are {plotly} and {shiny}.
{Shiny} is covered in another part of this handbook, so we will only cover {plotly} here. #TODO - link to shiny page
Making plots interactive can sound more difficult than it turns out to be, thanks to some fantastic tools.
In this section, you’ll learn to easily make a plot interactive with {the wonders {ggplot2} and {plotly}
## Warning: Removed 3 rows containing missing values (position_stack).
In the example you saw a very basic epicurve that had been transformed to bbe interactive using the fantastic {ggplot2} - {plotly} integrations. So to start, make a basic chart of your own:
Loading data
linelist <- rio::import("linelist_cleaned.xlsx")Manipulate and add columns (best taught in the epicurves section)
linelist <- linelist %>%
dplyr::mutate(
## If the outcome column is NA, change to "Unknown"
outcome = dplyr::if_else(condition = is.na(outcome),
true = "Unknown",
false = outcome),
## If the date of infection is NA, use the date of onset instead
date_earliest = dplyr::if_else(condition = is.na(date_infection),
true = date_onset,
false = date_infection),
## Summarise earliest date to earliest week
week_earliest = lubridate::floor_date(x = date_earliest,
unit = "week",
week_start = 1)
)Count for plotting
## Find number of cases in each week by their outcome
linelist <- linelist %>%
dplyr::count(week_earliest, outcome)Make into a plot
p <- linelist %>%
ggplot()+
geom_col(aes(week_earliest, n, fill = outcome))+
xlab("Week of infection/onset") + ylab("Cases per week")+
theme_minimal()Make interactive
p <- p %>%
plotly::ggplotly()Voila!
p## Warning: Removed 3 rows containing missing values (position_stack).
When exporting in an Rmarkdown generated HTML (like this book!) you want to make the plot as small as possible (with no negative side effects in most cases). For this, just add add this line:
p <- p %>%
plotly::partial_bundle()Some of the buttons on a standard plotly (as shown on the preparation tab) are superfluous and can be distracting, so it’s best to remove them. You can do this simply by piping the output into plotly::config
## these buttons are superfluous/distracting
plotly_buttons_remove <- list('zoom2d','pan2d','lasso2d', 'select2d','zoomIn2d',
'zoomOut2d','autoScale2d','hoverClosestCartesian',
'toggleSpikelines','hoverCompareCartesian')
p <- p %>%
plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)Earlier you saw #TODO link to heatmaps how to make heatmaps, and they are just as easy to make interactive.
## `summarise()` has grouped output by 'location_name'. You can override using the `.groups` argument.
## Joining, by = c("location_name", "week")
metrics_plot %>%
ggplotly() %>%
partial_bundle() %>%
config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)You can even make interactive maps! However, they’re slightly trickier. Although {plotly} works well with ggplot2::geom_sf in RStudio, when you try to include it’s outputs in Rmarkdown HTML files (like this book), it doesn’t work well.
So instead you can use {plotly}’s own mapping tools which can be tricky but are easy when you know how. Read on…
We’re going to use Covid-19 incidence across African countries for this example. The data used can be found on the World Health Organisation website.
You’ll also need a new type of file, a GeoJSON, which is sort of similar to a shp file for those familiar with GIS. For this book, we used one from here.
GeoJSON files are stored in R as complex lists and you’ll need to maipulate them a little.
## You need two new packages: {rjson} and {purrr}
pacman::p_load(plotly, rjson, purrr)
## This is a simplified version of the WHO data
df <- rio::import(here::here("data", "covid_incidence.csv"))
## Load your geojson file
geoJSON <- rjson::fromJSON(file=here::here("data", "africa_countries.geo.json"))
## Here are some of the properties for each element of the object
head(geoJSON$features[[1]]$properties)## $scalerank
## [1] 1
##
## $featurecla
## [1] "Admin-0 country"
##
## $labelrank
## [1] 6
##
## $sovereignt
## [1] "Burundi"
##
## $sov_a3
## [1] "BDI"
##
## $adm0_dif
## [1] 0
This is the tricky part. For {plotly} to match your incidence data to GeoJSON, the countries in the geoJSON need an id in a specific place in the list of lists. For this we need to build a basic function:
## The property column we need to choose here is "sovereignt" as it is the names for each country
give_id <- function(x){
x$id <- x$properties$sovereignt ## Take sovereignt from properties and set it as the id
return(x)
}
## Use {purrr} to apply this function to every element of the features list of the geoJSON object
geoJSON$features <- purrr::map(.x = geoJSON$features, give_id)plotly::plot_ly() %>%
plotly::add_trace( #The main plot mapping functionn
type="choropleth",
geojson=geoJSON,
locations=df$Name, #The column with the names (must match id)
z=df$Cumulative_incidence, #The column with the incidence values
zmin=0,
zmax=57008,
colorscale="Viridis",
marker=list(line=list(width=0))
) %>%
plotly::colorbar(title = "Cases per million") %>%
plotly::layout(title = "Covid-19 cumulative incidence",
geo = list(scope = 'africa')) %>%
plotly::config(displaylogo = FALSE, modeBarButtonsToRemove = plotly_buttons_remove)Plotly is not just for R, but also works well with Python (and really any data science language as it’s built in JavaScript). You can read more about it on the plotly website
This page lists common errors and suggests solutions for troubleshooting them
No such file or directory:
If you see an error like this when you try to export or import: Check the spelling of the file and filepath, and if the path contains slashes make sure they are forward / and not backward \. Also make sure you used the correct file extension (e.g. .csv, .xlsx).
#Tried to add a value ("Missing") to a factor (with replace_na operating on a factor)
Problem with `mutate()` input `age_cat`.
i invalid factor level, NA generated
i Input `age_cat` is `replace_na(age_cat, "Missing")`.invalid factor level, NA generated
You likely have a column of class Factor (which contains pre-defined levels) and tried to add a new value to it. Convert it to class Character before adding a new value.
Error in select(data, var) : unused argument (var)
You think you are using dplyr::select() but the select() function has been masked by MASS::select() - specify dplyr:: or re-order your package loading so that dplyr is after all the others.
Other common masking errors stem from: plyr::summarise() and stats::filter(). Consider using the conflicted package.
# ran recode without re-stating the x variable in mutate(x = recode(x, OLD = NEW)
Error: Problem with `mutate()` input `hospital`.
x argument ".x" is missing, with no default
i Input `hospital` is `recode(...)`.
Error: Insufficient values in manual scale. 3 needed but only 2 provided.
ggplot() scale_fill_manual() values = c(“orange”, “purple”) … insufficient for number of factor levels … consider whether NA is now a factor level…
Error: unexpected symbol in:
" geom_histogram(stat = "identity")+
tidyquant::geom_ma(n=7, size = 2, color = "red" lty"
If you see “unexpected symbol” check for missing commas
consider whether you re-arranged dplyr verbs and didn’t replace a pipe in the middle, or didn’t remove a pipe from the end.
Can’t add x object … Have a + at the end of a ggplot command that you need to delete.
rprofiles
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
R Markdown is a fantastic tool for creating automated, reproducible, and share-worthy outputs. It can generate static or interactive outputs, in the form of html, word, pdf, powerpoint, and others.
Using markdown will allow you easily recreate an entire formatted document, including tables/figures/text, using new data (e.g. daily surveillance reports) and/or subsets of data (e.g. reports for specific geographies).
This guide will go through the basics. See ‘resources’ tab for further info.
Background to Markdown
To explain some of the concepts and packages involved:
The R Studio website describes how these all link in together (https://rmarkdown.rstudio.com/authoring_quick_tour.html):
Creating documents with R Markdown starts with an .Rmd file that contains a combination of markdown (content with simple text formatting) and R code chunks. The .Rmd file is fed to knitr, which executes all of the R code chunks and creates a new markdown (.md) document which includes the R code and its output.
The markdown file generated by knitr is then processed by pandoc which is responsible for creating a finished web page, PDF, MS Word document, slide show, handout, book, dashboard, package vignette or other format.
This may sound complicated, but R Markdown makes it extremely simple by encapsulating all of the above processing into a single render function. Better still, RStudio includes a “Knit” button that enables you to render an .Rmd and preview it using a single click or keyboard shortcut.
Installation
To create R Markdown, you need to have the following installed:
install.packages('rmarkdown')install.packages('tinytex')
tinytex::install_tinytex() # install TinyTeX
Workflow
Preparation of an R Markdown workflow involves ensuring you have set up an R project and have a folder structure that suits the desired workflow.
For instance, you may want an ‘output’ folder for your rendered documents, an ‘input’ folder for new cleaned data files, as well as subfolders within them which are date-stamped or reflect the subgeographies of interest. The markdown itself can go in a ‘rmd’ subfolder, particularly if you have multiple Rmd files within the same project.
You can set code up to create output subfolders for you each time you run reports (see “Producing an output”), but you should have the overall design in mind.
Because R Markdown can run into pandoc issues when running on a shared network drive, it is recommended that your folder is on your local machine, e.g. in a project within ‘My Documents’. If you use Git (much recommended!), this will be familiar.
An R Markdown document looks like and can be edited just like a standard R script, in R Studio. However, it contains more than just the usual R code and hashed comments. There are three basic components:
1. Metadata: This is referred to as the ‘YAML metadata’ and is at the top of the R Markdown document between two ‘- - -‘s. It will tell your Rmd file what type of output to produce, formatting preferences, and other metadata sucsh as document title, author, and date. There are other uses not mentioned here (but referred to in ‘Producing an output’). Note that indentation matters.
2. Text: This is the narrative of your document, including the titles. It is written in the markdown language, used across many different programmes. This means you can add basic formatting, for instance:
_text_ or *text* to italicise**text** for bold text# at the start of a new line for a title (and ## for second-level title, ## for third-level title etc)* at the start of a new line for bullet pointstext to display text as code (as above)The actual appearance of the font can be set by using specific templates (specified in the YAML metadata; see example tabs).
You can also include minimal R code within backwards ticks, for within-text values. See example below.
3. Code chunks: This is where the R code goes, for the actual data management and visualisation. To note: These ‘chunks’ will appear to have a slightly different background colour from the narrative part of the document.
Each chunk always starts with three backticks and chunk information within squiggly brackets, and ends with three more backticks.
Some notes about the content of the squiggly brackets:
There are also two arrows at the top right of each chunk, which are useful to run code within a chunk, or all code in prior chunks.
## Producing an output { }
General notes
Everything used by this markdown must be referenced within the Rmd file. For instance, you need to load any required packages or data.
A single or test run from within R Markdown
To render a single document, for instance if you are testing it or if you only need to produce one rendered document at a time, you can do it from within the open R Markdown file. Click the “knit” button" at the top of the document.
The ‘R Markdown’ tab will start processing to show you the overall progress, and a complete document will automatically open when complete. This document will also be saved in the same folder as your markdown, and with the same file name aside from the file extension. This is obviously not ideal for version control, as you will then rename the file yourself.
A single run from an separate script
To run the markdown so that a date-stamped file is produced, you can create a separate script and call the Rmd file from within it. You can also specify the folder and file name, and include a dynamic date and time, so that file will be date stamped on production.
rmarkdown::render(("rmd_reports/create_RED_report.Rmd"),
output_file = paste0("outputs/Report_", Sys.Date, ".docx")) # Use 'paste0' to combine text and code for a dynamic file name
Routine runs into newly created date-stamped sub folders
Add a couple lines of code to define the date you are running the report (e.g. using Sys.Date as in the example above) and create new sub folders. If you want the date to reflect a specific date rather than the current date, you can also enter it as an object.
# Set the date of report
refdate <- as.Date("2020-12-21")
# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)
#Run the loop
rmarkdown::render(("rmd_reports/create_report.Rmd"),
output_file = paste0(outputfolder, "/Report_", refdate, ".docx")) #Dyanmic folder name now included
You may want some dynamic information to be included in the markdown itself. This is addressed in the next section.
Parameterised reports are the next step so that the content of the R Markdown itself can also be dynamic. For example, the title can change according to the subgeography you are running, and the data can filter to that subgeography of interest.
Let’s say you want to run the markdown to produce a report with surveillance data for Area1 and Area2. You will:
filter(area == params$areanumber) rather than filter(area=="Area1").For instance (simplified version which does not include setup code such as library/data loading):
You can change the content by editing the YAML as needed, or set up a loop in a separate script to iterate through the areas. As with the previous section, you can set up the folders as well.
As you can see below, you set up a list which includes all areas of interest (arealist), and when rendering the markdown you specify that the parameterized areanumber for a specific iteration is the Nth value of the arealist. For instance, for the first iteration, areanumber will equate to “Area1”. The code below also specifies that the Nth area name will be included in the output file name.
Note that this will work even if an area or date are specified within the YAML itself - that YAML information will get overwritten by the loop.
# Set the date of report
refdate <- as.Date("2020-12-21")
# Set the list (note that this can also be an imported list)
arealist <- c("Area1", "Area2", "Area3", "Area4", "Area5")
# Create the folders
outputfolder <- paste0("outputs/", refdate) # This is the new folder name
dir.create(outputfolder) # Creates the folder (in this case assumed 'outputs' already exists)
#Run the loop
for(i in 1:length(arealist)) { # This will loop through from the first value to the last value in 'arealist'
rmarkdown::render(here("rmd_reports/create_report.Rmd"),
params = list(areanumber = arealist[1], #Assigns the nth value of arealist to the current areanumber
refdate = refdate),
output_file = paste0(outputfolder, "/Report_", arealist[1], refdate, ".docx"))
}
Further information can be found via:
A good explainer of markdown vs knitr vs Rmarkdown is here: https://stackoverflow.com/questions/40563479/relationship-between-r-markdown-knitr-pandoc-and-bookdown
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Here is an online guide to using Github and R. Some of the below text is adapted from this guide.
Github is a website that supports collaborative projects with version control. In a nutshell, the project’s files exist in the Github repository as a “master” version (called a “branch”). If you want to make a change to those files you must create a different branch (version) to build and test the changes in. Master remains unaffected by your changes until your branch is merged (after some verification steps) into the master branch. A “commit” is the saving of a smaller group of changes you make within your branch. A Pull Request is your request to merge your changes into the master branch.
The way RStudio and Github interact is as follows:
Epi_R_handbook R project that lives on Github website repository - master and other branches all exist and are viewable on this Github repository. Pull requests, issue tracking, and de-conflicting merges happens online here.
Image source
Epi_R_handbookEpi_R_handbook repository on GithubIn your RStudio you will now have a Git tab in the same tab as your R Environment:
Please note the buttons circled as they will be referenced later (from left to right):
Note: Last I heard, Github will soon change their terminology of “master” to “main”, as it is an unnecessary reference to slavery
Once done with your commits and pushed everything up to the remote Github repository, you may want to request that your branch be merged with the master branch.
GO to the repository on Github and click the button to view all the branches (next to the drop-down to select branches). Now find your branch and click the trash icon next to it. Read more here
Be sure to also delete the branch locally on your computer:
TEST IT
You can test your ability to make changes, commits, pull requests, etc. by modifying this R script which is saved to the main Rproject folder: test_your_abilities.R
Asked to provide password too often??
Instructions for connecting to the repository via a SSH key (more complicated):
See chapters 10 and 11 of this tutorial
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
The Page title should be succinct. Consider adding a tag with no spaces into the curly brackets, such as below. This can be used for internal links within the handbook.
{#title_tag }
Keep the title of this section as “Overview”.
This tab should include:
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Using R on network or “company” shared drives can be extremely frustrating. This page contains approaches, common errors, and suggestions on troubleshooting, including for the particularly delicate situations involving Rmarkdown.
Using R on Network Drives: Overarching principles
Using R on Network Drives: Overarching principles
Useful commands
# Find libraries
.libPaths() # Your library paths, listed in order that R installs/searches.
# Note: all libraries will be listed, but to install to some (e.g. C:) you
# may need to be running RStudio as an administrator (it won't appear in the
# install packages library drop-down menu)
# Switch order of libraries
# this can effect the priority of R finding a package. E.g. you may want your C: library to be listed first
myPaths <- .libPaths() # get the paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign them
# Find Pandoc
Sys.getenv("RSTUDIO_PANDOC") # Find where RStudio thinks your Pandoc installation is
# Find a package
# gives first location of package (note order of your libraries)
find.package("rmarkdown", lib.loc = NULL, quiet = FALSE, verbose = getOption("verbose")) “Failed to compile…tex in rmarkdown”
check/install tinytex, to C: location
# check/install tinytex, to C: location
tinytex::install_tinytex()
tinytex:::is_tinytex() # should return TRUE (note three colons)Internet routines cannot be loaded
For example, “Error in tools::startDynamicHelp() : internet routines cannot be loaded”
C: library does not appear as an option when I try to install packages manually
Pandoc 1 error
If you are getting pandoc error 1 when knitting Rmarkdowns on network drives:
myPaths <- .libPaths() # get the library paths
myPaths <- c(myPaths[2], myPaths[1]) # switch them
.libPaths(myPaths) # reassign themPandoc Error 83 (can’t find file…rmarkdown…lua…)
This means that it was unable to find this file.
Possibilities:
R is not able to find the ‘rmarkdown’ package file, so check which library the rmarkdown package lives.
If it is in a library that in inaccessible (e.g. starts with "\") consider manually moving it to C: or other named drive library.
But be aware that the rmarkdown package has to be able to reach tinytex, so rmarkdown package can’t live on a network drive.
Pandoc Error 61 For example: “Error: pandoc document conversion failed with error 61”
“Could not fetch…”
LaTex error (see below)
“! Package pdftex.def Error: File `cict_qm2_2020-06-29_files/figure-latex/unnamed-chunk-5-1.png’ not found: using draft setting.”
“Error: LaTeX failed to compile file_name.tex.”
See https://yihui.org/tinytex/r/#debugging for debugging tips.
See file_name.log for more info.
Pandoc Error 127 This could be a RAM (space) issue. Re-start your R session and try again.
Mapping network drives
How does one open a file “through a mapped network drive”?
ISSUES WITH HAVING A SHARED LIBRARY LOCATION ON NETWORK DRIVE
Error in install.packages()
Try removing… /../…/00LOCK (directory)
This tab should stay with the name “Resources”. Links to other online tutorials or resources.
Saving files, deleting files, creating folders, interacting with files in a folder, etc Overwriting files in Excel
Keep the title of this section as “Preparation”.
Data preparation steps such as:
Can be used to separate major steps of data preparation. Re-name as needed
Can be used to separate major steps of data preparation. Re-name as needed.
This tab can be renamed. This tab should demonstrate execution of the task using recommended package/approach. For example, using a package customized for this task where the execution is simple and fast but perhaps less customizable. For example using incidence package to create an epicurve.
Sub-tabs if necessary. Re-name as needed.
This tab can be re-named. This tab should demonstrate execution of the task a more standard/core package (e.g. ggplot2, or base R) that allows for more flexibility in the output or more package stability. For example, showing how to create an epicurve using ggplot2.
Sub-tabs if necessary. Re-name as needed.
This tab should stay with the name “Resources”. Links to other online tutorials or resources.